home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
MacFormat 1997 January
/
macformat-046.iso
/
Shareware Plus
/
Developers
/
EnterAct
/
EnterAct Stuff
/
Documentation
/
hAWK User’s Manual
< prev
next >
Encoding:
Amiga
Atari
Commodore
DOS
FM Towns/JPY
Macintosh
Macintosh JP
NeXTSTEP
RISC OS
UTF-8
Wrap
Text File
|
1996-08-06
|
179.6 KB
|
3,614 lines
|
[
TEXT/KEEN
]
********************* hAWK User’s Manual *********************
Copyright © 1991 the Free Software Foundation, Inc. You can redistribute or modify
this file under the terms of the GNU General Public License as published by
the Free Software Foundation (see the file “COPYING hAWK”).
font: Geneva 10. Four spaces per tab.
hAWK is NOT a stand–alone application: it must be called by some other application.
Interaction between hAWK and the calling application will vary according to how
well the calling application supports text documents. However, virtually any
(C-based) application can add the ability to call hAWK. For details, see
“Calling hAWK from your application” near the end of this manual.
Applications which support calling hAWK (add yours to the list!):
Minimal App (included, with source code)
EnterAct, RFEdit, EnterAct Lite
You can read this document with any programmer’s editor (you may not see the 4
pictures - they’re not that critical). You’ll need an editor to view the results of a
program run if you use Minimal App to call hAWK, since Minimal App does not do
anything with text files, and you’ll find that Minimal App, with its minimal level of
support, has limited program input options. In fact, calling hAWK through Minimal
App shows what hAWK would look like if it were repackaged as a stand–alone
application. See “Calling hAWK through Minimal App” (in the “Advanced topics”
chapter) for tips on using Minimal App with an editor to run hAWK programs.
Major topics are marked with MPW-compatible marks, available in many editors by
holding down the <Option> or <Command> key while clicking in the window’s title bar.
You can jump to a section heading by selecting the heading in the table of contents and
using the editor’s “Enter Selection”/“Find Again” commands. The “Active index” at
the end of this manual is suitable for on-line use, consisting of line numbers rather
than page numbers; to jump to the line for a reference in the index, select the
corresponding line number and use the editor’s “Go to” command.
If you change the content of this manual you will throw off the Active index, and will
lose the marker locations also if the editor doesn’t manage MPW–compatible marks.
However, feel free to add or delete markers, or change the font.
Why bother to learn hAWK?
• Many editing and formatting problems that crop up in the life of a C programmer
can be solved with a simple hAWK program. Now you have a choice—grind out a
series of mechanically–repeated key strokes, or dash off an elegant little program.
And when it comes time to solve a problem, a typical hAWK program can be run
with two mouse picks and a press of the <Return> key (or even a command line).
• On the Mac alone, there are versions of AWK that run under the MPW shell, under
A/UX, and now with hAWK there is a version that’s handy to use in conjunction with
THINK C. Never mind all the DOS and Unix implementations—even on the Mac, hAWK
is a widely–used language. You’re not learning a white elephant, here.
• Need to prototype a “little” language? Try out an algorithm? Looking for an
introduction to C that comes with air bags? This is it. For a sampling of what hAWK
can do, see “About the supplied programs” below.
Contents
-----------
Introduction
Installing hAWK
Where to go from here
About hAWK
From AWK to gAWK to hAWK
What’s missing
What’s new
The calling application
A typical hAWK run
Running hAWK programs
The setup dialog
Concurrent and immediate modes
Selecting your program
Selecting input for a program
Setting variables
Library files
Showing the results
Saving the setup for a program
Cancelling a run
Standard input and output
About the supplied programs
hAWK program structure
From start to finish
Grouping and breaking lines
The command line and ARGV[]
Variables and constants
Variable names and types
Constants
Record and field variables
Built–in variables
Local variables in functions
Setting variables on the command line
Conversion between numbers and strings
Arrays
Patterns
Patterns and actions
BEGIN and END
Expressions as patterns
String-matching patterns
Regular expressions
Compound patterns
Range patterns
Summary of patterns
Actions
Introduction
A preview of “print’
Expression operators
Built–in numeric functions
Built–in string and file functions
Control-flow statements
Empty statements
User-defined functions
Output
The “print” statement
The “printf” statement
Output into files
Closing files
Input
FS, the input field separator
RS, the input record separator
The “getline” function
The “hAWK” function
Advanced topics
Other ways of specifying input files
Beyond input records
Calling hAWK through Minimal App
Calling hAWK from your application
What and how
Getting started
Add two calls in your code
A minimal version
Callbacks, and showing results
Using a command line
Modifying hAWK
Introduction
hAWK THINK C project
Source
Libraries
Active index
-------------
Introduction
-------------
hAWK is AWK adapted for the Macintosh, a small programming language which is
well-suited to jobs involving text manipulation and pattern recognition. hAWK
is not a stand-alone application, but is rather a CODE resource with a specific simple
calling interface (called a "Drag_on Module"), and it is invoked by selecting "hAWK"
from a menu in an application that can call Drag_on Modules.
This manual will explain in more detail what hAWK is, and show you how to run hAWK
programs. There are many useful programs suppled in the "hAWK programs" folder,
each with complete instructions at the top so you can try them out as you go along; they
range from very simple to rather complex, general purpose to very special purpose,
and illustrate the wide range of hAWK’s abilities, from counting lines in a file to
cross–referencing your C source. The chapter below entitled “About the supplied
programs” provides an overview of the programs in the “hAWK programs” folder.
These programs are not just useful as “examples to learn from”—they are, for the
most part, nontrivial, and supply real answers to the daily problems of a C
programmer.
What is hAWK really? hAWK is what C could be if you weren't in a hurry. hAWK
programs are relatively small, look rather like C code, and rely on powerful built-in
capabilities and commands—capabilities like automatic reading of input files on a
line-by-line basis, commands such as "gsub" which is, just on its own, as powerful
as Grep. The focus is on text, but the text can be just about anything—the sample
program “$Print_MENU_Resource”, for example, will take the hex representation
of a MENU resource as retrieved by Read Resource and format it to be human–readable.
The primary difference between hAWK and other versions of AWK lies in the method of
running programs; hAWK’s setup dialog allows you to run programs with just a few
mouse clicks, with typing needed only if you wish to assign initial values to variables
before a run. This is mainly because hAWK can take advantage of the window and file
handling abilities of the application that is used to call it, to offer the options of taking
input for the hAWK program from text in the front window of the calling application,
or from the list of files selected for multi–file operations. These generalised input
specifications, “whatever’s in the front window” and “whatever’s selected for
multi–file operations”, eliminate the need to type in a list of file names for a program
to use as input. And since each program can remember the general input method you
have selected for it, repeated runs of a program are reduced to: bringing the input to
hAWK’s attention, either by bringing a text file to the front or by selecting files for
multi–file searching; and then running the program with three mouse clicks. This all
makes hAWK as easy to run as a macro language, and since AWK is a widely–used,
full–featured programming language you should find it well worth the effort of
learning.
Although running hAWK with the setup dialog is normally the easiest way, hAWK
can also be called with an old-fashioned command line. You can’t pass any frontmost
text to hAWK this way (since the frontmost text will be the command line), but
you can typically specify all files selected for multi-file operations as input, or
specify one or more input files using full path names. If you want to implement an
application that supports calling hAWK via a command line, please see the section
"Using a command line" in the "Calling hAWK from your application" chapter below.
If you just want to use the command line approach, see the documentation supplied
with your application that calls hAWK for the details on how to do it.
---------------
Installing hAWK
---------------
If you can read this, then you’ve installed hAWK, since it is being shipped in
compressed form these days. As a reminder, hAWK should be inside your
"Drag_on Modules" folder, and this folder should be in the same folder that
contains the calling application, at the same level. The "hAWK programs" folder
should also be in the "Drag_on Modules" folder, and this manual can go anywhere.
To verify that hAWK has been installed, start up an application that can call hAWK
and then check the menus; you should see “hAWK” as one of the items. Select “hAWK”,
and the setup dialog for hAWK will appear. Venture on ahead fearlessly if you like,
armed with the magic incantation that holding down the <Command> key while typing
a <period> will interrupt any running hAWK program.
------------------
Where to go from here
------------------
Read straight ahead here until you’ve tried out a few hAWK programs and are comfortable
with the overall approach to running them. The supplied programs in the “hAWK
programs” folder are worth exploring to get a feel for what hAWK can do—and you’ll
likely find that several of them provide answers to problems little or big that you
regularly face. The remainder of this manual delves into the inner workings of hAWK,
necessary reading if you want to write your own hAWK programs (and who could
resist?). If you make use of the markers in this manual for the chapter and
section headings, and the active index at the end listing topics, you’ll be able to browse
around almost as easily as with a printed book.
This is a good–sized manual, and if you try to read straight through it at one sitting
you’ll probably hurt your head. Just amble along at a gentle pace, and when ideas or
questions pop up, you’ll find it well worth the effort if you take a moment to write
a one or two–line hAWK program to try the notion out. Running a hAWK program
takes just a few mouse clicks. The easiest way is with “$RunClip” (see chapter H).
You can, if you wish, print this manual yourself. Aha, but what about that index, which
lists line numbers rather than page numbers? Thought you might ask that—what you
want, then, is a version of this manual with line numbers added at the beginning of each
line. An ideal job for hAWK!
1 Use a “Save As” command to save this manual under a different name, such
as “hAWK Manual” (or save it under the same name but in a different folder):
2 Select “hAWK” from the calling application’s menu, and the setup dialog will appear;
select “$AddLineNumbers” from the “Main program:” popup menu at the top; pick the
“Select input file” option from the “Take input from:” popup, and use the standard
Open dialog that appears to select the copy of this manual that you just created:
2a Click “Run” and wait a bit....and you’re back in the calling application:
3 Open the copy of this manual—if you left it on–screen while running hAWK,
choose “Revert” to see the changed version (you can force Revert to be enabled by
typing one character in the window):
4 Print the result —change the font first, if you like.
5 Note, to include the pictures you will have to use ResEdit to copy them from the
original manual to your copy of the manual, and use EnterAct to print. They deal with
the setup dialog only, and you shouldn’t miss them much if you don’t bother.
A very readable description of AWK (excluding the Macintosh variations of hAWK) can
be found in
"The AWK Programming Language" ,
Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger,
Addison-Wesley, 1988. ISBN 0-201-07981-X.
on the "Languages" or "unix" wall of your favourite bookstore.
A more relaxed, though less ambitious, introduction can be found in
"sed & awk"
Dale Dougherty
O’Reilly & Associates, Inc., 1991. ISBN 0-937175-59-5.
The coverage of regular expressions is especially sympathetic.
----------
About hAWK
----------
From AWK to gAWK to hAWK
hAWK is a Macintosh version of AWK, a pattern-recognition and data-manipulation
language that is popular on unix systems. This version of hAWK is a modification of
GAWK, the GNU Project's implementation of the AWK programming language, which
differs in only minor ways from "classic" AWK. "hAWK" will be the name used
below, except where differences from Gawk or AWK need pointing out.
AWK has a venerable history, going all the way back to 1977 when messrs Aho,
Weinberger, and Kernighan developed it at Bell Labs to fill in some small holes
in Unix. The idea then was to write one or two–line programs to solve simple
pattern–matching and text or number transforming problems—programs so small
that you wouldn’t even bother to save them, just type them in on the fly, right on
the command line. Over the years, users have pushed the limits of AWK, and many
features have been added (user–definable functions being the nicest), and now
multiple–page AWK programs are commonplace.
GAWK is a Unix/IBM version of AWK, developed around 1986 by Paul Rubin and Jay
Fenlason and copyright by the Free Software Foundation. It adds some useful
enhancements to AWK, dealing mostly with files and variables.
hAWK is essentially GAWK adjusted for the Macintosh, with the addition of a dialog
interface to take advantage of windows and mice. If you wish to distribute hAWK, by
the way, you should note that it is governed by the Free Software Foundation’s
copyright restrictions (not too horrible) which you can find in the file “COPYING
hAWK” in the source code folder for hAWK.
What’s missing
Pipes are missing. Pipes take a full–fledged shell to run, and most applications aren’t
up to it. Since hAWK is packaged as a CODE resource to be called by any old application,
pipes had to go. Similarly, the “system” command (which allows one to call other shell
commands from within an AWK program) has been dropped.
What’s new
The interface is new. No more command line—most hAWK programs can be run with
just a few mouse clicks, and typing is needed only if you want to set the value of
variables before running the program. (Note a command line is supported though.)
There are seven new built–in string functions, “lookup”, “sort”, “time”, “prompt”,
“progress”, “getclip”, and “putclip”, described in “Built–in string and file functions” in the
“Actions” chapter. Some new file and directory functions are also described there.
The “lookup” function returns the type of a C term as an integer code (#define = 1,
variable = 2, etc), useful when doing cross-referencing. It relies on the calling
application for this diagnosis, so hAWK programs that use “lookup” should be called
only through applications which support it (Minimal App doesn’t).
The “sort” function is provided to (mostly) make up for the lack of a shell sorting
function. It’s fast, and can do ASCII, numeric, or dictionary–order sorting of an array,
in forward or reverse order.
The “time” function produces the current date and time, to the second.
The “prompt” function prompts you with a dialog to enter some text, and returns what
you enter as a string, as in
X = prompt("Please enter a value for X:")
The “progress” function allows you to show (and update) a message while a program is
running.
The “getclip” function returns a string holding the calling application’s current (up to
the second) private clipboard. This can be used to pass instructions or data to a hAWK
function while it is running concurrently with your application (more on this,
needless to say, below). Similarly, putclip puts a new string of text on the clip.
As a partial replacement for the “system” command, any hAWK program can call any
other hAWK program as a “subroutine”, via the “hAWK()” function. Using this
function, a program can generate a special-purpose program and immediately
execute it (eg $MFS_SuperReplace), or selectively execute a series of programs (eg
$Chain). It also allows you to type in and run programs without saving them first (eg
$RunClip). This function is decribed in its own chapter, “The hAWK function”.
Three built–in variables have been added; RUNERR, STDPATH, and TIME. See “Built-in
variables” in the “Variables and constants” chapter for details.
hAWK uses the concept of standard input, output, and error, but strictly in the
form of files with the fixed names $tempStdIn, $tempStdOut, and $tempStdErr.
These files are created and written to as needed, and can be found in the same
folder that contains your “Drag_on Modules” folder after you’ ve begun running
hAWK programs. These are temporary files, and will normally be overwritten
by each hAWK program run.
The regular expressions implemented in hAWK are full regular expressions, with the
ability to tag subexpressions, match word boundaries, ignore case, and deal with
multi–line strings. Just about anywhere else in this world, you’ll find either full
regular expressions or the ability to tag subexpressions, but not both. One minute you
want the “or” operator, the next minute you want to tag something—it gets rather
frustrating. There is absolutely no good reason not to allow both together, so in hAWK
you’ve got them. Speaking of gripes, most Grep’s will limit you to a single line—that’s
not just frustrating, it’s downright crippling. (By the way, another major
improvement over Grep is that in AWK/hAWK your regular expression can be the
string resulting from the evaluation of one or more variables, eg
if (no_plus_or_minus)
integer_pattern = digits; # digits == "[0123456789]+"
else
integer_pattern = plus_or_minus digits; # plus_or_minus == "[+-]?"
—and a pleasant side–effect is that regular expressions can be very readable if you want.)
For the details, see “Regular expressions” in the “Patterns” chapter.
If the calling application supports the notion, your hAWK programs will by default
run concurrently with your calling app. This means you start up the hAWK program,
and then go back to working in your application (or background it and work somewhere
else) until the hAWK program is done. The “prompt” and “progress” functions are
non-functional in this concurrent mode, so you can run programs in the “immediate”
mode, which supports “prompt” and “progress” by holding down the
<Shift> key while selecting “hAWK” from the calling application’s menu. In
immediate mode, you will be locked out of the calling application until the hAWK
program ends. Programs will run more slowly in concurrent mode (the speed
drop being slightly greater if you put the calling application in the background),
but this is usually more than compensated for by being able to carry on with other
things, rather than just sit there watching the watch cursor. The running hAWK program
usually doesn’t affect application performance very much. For more about this,
see “Concurrent and immediate modes” in the “Running hAWK programs” chapter.
The calling application
Any C-based application can call hAWK and other Drag_on Modules, as the source
code for Minimal App demonstrates. The level of interaction between hAWK and
the calling application is up to the author of the calling application, and can vary
more or less according to the following table:
Level Support for interactive features
---- -----------------------
minimal (none; no result showing, input options limited to one specific file)
basic text pass front text window as input option, show stdout after a run
full text basic text, and pass list of selected files as input option
full full text, diagnose the type of a C code term, pass the clipboard
If the application you are using provides only minimal support, then some extra
manual steps are needed to persuade a hAWK program to take input from the current
front text file or a list of files, and to view the results of a run; see “Calling hAWK
through Minimal App” in the “Advanced topics” chapter for some tips on this. The
discussion there is “advanced” only if you want to understand all the details—you
can use the methods described there by rote (for example, if it says paste this bit
of code into the top of a program and you’ll have support for taking input from a list
of files, you can do it now and worry about how it works later).
------------------
A typical hAWK run
------------------
Have you installed hAWK yet? If not, now would be a good time (see above).
We’ll assume that you’re calling hAWK through an application that supports passing all
or part of the front text window as input options, and showing stdout after a run, to
make life simpler. If you don’t have such an application, you can use Minimal App in
conjunction with whatever editor you are using to view this file, as described in the
“Advanced topics” section “Calling hAWK through Minimal App”.
One of the programs supplied with hAWK is “$EnumSwitch”, which takes a list of
enum constants and generates a “switch” statement based on them. It’s contained in the
folder “hAWK programs”, which is inside the “Drag_on Modules” folder—you might
like to take a look at it first....
OK, here we go: first, move this window on your screen so that you can see the next few
lines while the hAWK setup dialog is in front (select hAWK now from the appropriate
menu and Cancel to see where it appears). Now select the following line of text:
{first, second, third, fourth, twilightZone = -99}
-is it highlighted? Good. Now, select hAWK from the menu; when the dialog
appears, select “$Enumswitch” from the top popup menu called “Main program:”, and
finally, click on the Run button or hit <Return> on the keyboard.
You should be back in the calling application now, with a switch statement coming up in a
window called “$tempStdOut”. hAWK took the line that you highlighted above, stripped
it down, built a switch statement out of the words, and wrote the results to the disk file
“$tempStdOut”. The calling application is now showing you the resulting file, with
contents selected and ready for pasting into your source code.
Most hAWK programs can be run this easily. Now, the full story.
------------------
Running hAWK programs
------------------
The setup dialog
When you select hAWK, the above “setup” dialog always appears first. A typical
program run consists of: setting up the input to be ready for hAWK; selecting hAWK to
see the setup dialog; selecting the program to run from the “Main program” popup
menu; and hitting the Run button. If you have variables in the program that need
to be set just before running the program, then you can set up to 10 variables by
using the dialog that appears when you click the “Set variables” button. The input
option, variable settings, and names of any associated libraries can all be saved with
a hAWK program via the “Save settings” button, so that when you run a program again
you‘ll need to adjust the setup only for things that have changed (typically only the
values to be initially assigned to variables, if anything).
Concurrent and immediate modes
With most little languages, when you run a program that’s all you do—run the
program. No continuing to work in your primary application, let alone switching to
another application. In the rare case when you want hAWK to completely take over your
Macintosh, locking you out of the calling application, hold down the <Shift> or <Option>
key while selecting “hAWK” from the calling application’s menus. If the program uses
the “prompt” or “progress” functions, it will be necessary to run in this
“immediate” mode, since they just return null results in the “concurrent” mode.
In all other cases, just select “hAWK” from the calling application’s menus
without holding down the <Shift> or <Option> key, and if the calling application
supports it, you’ll be returned almost immediately to your application, able
to carry on working there while the hAWK program runs at the
same time. This “concurrent” mode of running programs does not greatly
slow down the calling application or any other application that you switch to.
The hAWK program itself will run more slowly than in immediate mode, often
taking about 50% longer—but if you don’t need the results in a huge rush,
stick to the concurrent mode and just forget about the hAWK program until
it winds up with a beep.
While a hAWK program is running concurrently, you won’t by able to run any
additional Drag_on Modules. This is because they all use the same standard output
file ($tempStdOut), and a fight could develop over who gets to write to it.
While a hAWK program is running concurrently, you will not be able to save to
any files that hAWK is using. Regular input files are accessed only one at a time,
and the standard input/output/error files will normally be “busy” from
beginning to end of the run. In addition, any files being read from or written to
via redirection (see “Output” and “Input” chapters) will not be writeable.
However, you will be able to open any file that hAWK is using to take a look
at it. With a lengthy program, you can check in with hAWK now and then by
opening (or reverting) $tempStdOut to get a snapshot of how things are
progressing.
See the supplied program “$LogDaemon” for an example of a hAWK program which
idles unobtrusively underneath your calling application, waiting to take special
action when you copy a specific instruction to the application’s clipboard. A
“daemon”, by the way, is an invisible, powerful spirit with your best interests
at heart. It “possesses” your Macintosh, in a nice way. And the name is a bit more
entertaining than the plain old “forks” and “threads” etc.
Concurrent execution is currently supported by: EnterAct.
Selecting your program
The “Main program:” popup at the top of the setup dialog lists all text files in the
“hAWK programs” folder whose names begin with a dollar sign ($). This list is
rebuilt each time you call hAWK. If a program is not listed in the popup, you can still
run it by picking “Select unlisted program”, the first item in the “Main program”
popup, and then using the standard Open dialog that appears to select the program—note
it could be in another folder, or in the “hAWK programs” folder but not shown in the
popup simply because its name doesn’t start with a “$”. You can avoid clutter in this
popup by starting the names of only your most popular hAWK programs with a “$”, so
that other less–frequently used programs won’t be shown in the popup—if they are in
the hAWK programs folder, they will still be close at hand.
Selecting input for a program
This is one of hAWK’s nicest features, allowing hAWK to interact with the calling
application to provide quick input file specification. Two additional ways of specifying
input files, not listed in the “Take input from” popup, are described in the
“Advanced topics” chapter, in “Other ways of specifying input files”.
Under the “Take input from” popup menu, the options “Front text selection” and
“All of front text” refer to the text window that happens to be in front just before you
call hAWK from the calling app’s menu. According to what you select here, all or just the
selected part of the text in the front window will be written to a temporary file called
“$tempStdIn”, and passed to your program as the input file to use. If your program
is to be run using one of these options, bring the text window containing the text to be used
as input to the front just before calling hAWK, and if you’ll be using the “Front text
selection” option, you should select the text as well. For an example, see
“A typical hAWK run” above, where this manual itself served as the front text.
The “MFS selected files” option in the “Take input from” popup refers to a list of files
selected in the calling application for multi–file operations (typically this list is used
mainly for multi–file searching in the calling application, and you construct it by
placing check marks or bullets • beside file names—see the calling app’s manual for
details). With this option selected, all files selected for multi–file operations will be
passed to the hAWK program as input. This means you can set up a list of files in the
calling app, and then have your hAWK program take its input from those files, from
one file to hundreds. One limitation of this approach is that you can’t specify the exact
sequence in which the files will be dealt with. With many programs, this is not a
problem (multi–file search and replace, for example). To treat input files in a specific
order see “Other ways of specifying input files” in the “Advanced topics” chapter.
The “Select input file…” option allows you to use a standard Open dialog to pick one
specific file to use as input for a hAWK program. As with all other aspects of the
setup dialog, if you click “Save settings” the name of the file you select will be saved
with the program itself, and restored for the next run.
Aside from “Select input file…”, input options will not be shown if they are not
currently available.
In rare cases, you may need no input at all for your program. To ensure that no
input is passed, pick the “Select input file…” input option, cancel the Open dialog
that appears, and then click the “Save settings” button. The input option for your
program will thereafter read “Select input file…”, as though imploring you to
pick one, but no input will be sent to your program. It’s harmless if input is sent
to a program that doesn’t want any, the only penalty being time lost if a massive
amount of input is accidentally ordered along for the ride.
Setting variables
The “Set variables” button allows you to preset the values of variables just before
running a progam, without having to edit the program itself. As you can see from the
picture, it’s a simple matter of typing the variable name, followed by an “equals”
sign, followed by the value of the variable, either a number or a string.
Quotes should not be used to surround strings; just enter the string itself. Any
spaces between the “=” and the value will count as part of the value, so normally
you should enter the value with no spaces between the equals sign and the value.
For example,
find =spot
and
find = spot
produce different results. Spaces are optional between the name of the variable
and the equals sign.
The limit on the length of the variable assignment, including the name of the
variable, is 100 characters. Up to 10 variables may be given values this way.
Special characters such as tabs and returns can be placed in a string by using the
standard escape sequences familiar from C, eg
find =\tspot\n
assigns to “find” the string consisting of a tab, followed by s-p-o-t, followed
by a carriage return.
You can also assign the value of a (dynamic) regular expression using the “Set
variables” dialog, for example
find =\.#[A-Za-z_]+ (never mind what it means for now)
—note there is no need to enclose it in forward slashes, and many characters must
be escaped with a backslash if you want them matched literally (the section
“Regular expressions” in the “Patterns” chapter explains the nuances).
Clicking the “Save settings” button will save your variable assignments for
subseqent runs. Hence you’ll need to use the “Set variables” dialog only
when the preset value of some variable changes.
If variable presets exist for a program then the “Set Variables” button will
acquire a gray outline as a reminder that some variables may need changing
before running the program. With some programs (such as $CompareFiles) you’ll
almost never change the preset variables, but with others (such as $MFS_SuperLister)
you’ll want to change one or more variables before almost every run.
Library files
Technically, this is an advanced topic, but it’s simple to use. If you develop
some general–purpose functions, such as sorting routines, that you wish to
use in several programs without duplicating the function definitions within
each program, you can save the functions in a separate file and add that file
to each main program as a library. The contents of the library file are
simply appended to the contents of the main program before running it, so the
library can in fact contain any valid hAWK statements. However, to preserve
sanity, libraries should be restricted to just functions.
To add a library file to a main program:
1 Use the “Main program” popup to select the program
2 Use the “Select library…” item in the “Libraries:” popup to add the library
by using the standard Open dialog that appears.
3 Clicking the “Save settings” button will preserve your selection of libraries
for for subsequent runs.
To delete a library, select it using the “Libraries:” popup.
One sample library is included, in the file “SortLibrary”. It is not used in the
sample programs, it’s just an example (PLEASE NOTE hAWK has its own
built–in sort function, which is very fast). Little is lost if you follow the
policy of not using libraries—programs are easier to read if all the code is in one place.
Showing the results
Output from hAWK programs is produced by “print” or “printf” statements,
which send their output to the file “$tempStdOut” unless you explicitly
redirect it. For example,
print "some text"
will print the string "some text" to $tempStdOut.
The file $tempStdOut is created and managed for you, and most hAWK programs will
send at least some output to this file. If you would like to see this file after the program
is finished, put a check mark in the “Show stdout” checkbox in the setup dialog just
before running the program. When the program is done, the calling application will
then show you the $tempStdOut file in a window, if it is able to. If the calling
application doesn’t support showing stdout, you’ll have to manually Open or Revert the
$tempStdOut file using your editor (for more on this see “Calling hAWK through
Minimal App” in the “Advanced topics” chapter).
Place a check mark in the “Select all of stdout” check box to have all of the output
in the $tempStdOut window selected at the end of the program run. This is handy
if you’ll be wanting to copy the entire output and paste it in elsewhere.
Saving the setup for a program
The “Save settings” button saves away your selection of options for a program, so that
they will be restored for subsequent runs of the program. These options are saved with
the program itself, in a special resource. The saved options are:
1 The names of any libraries associated with the program
2 Names and values of any preset variables
3 Your choice of input option, including the input file name if you have used
the “Select input file…” option to pick a specific file.
4 Your output options, in the checkboxes “Show stdout” and “Select all of stdout”.
During the first run of a program that you have written, you should set up the options
you want and then click the “Save settings” button. Subsequent runs will then consist
of just these steps:
1 Select “hAWK” from the calling application’s menu
2 Use the “Main program” popup to select the program
3 Use the “Set variables” button if needed to put in new values for variables
(many hAWK programs don’t need this)
4 Click the Run button.
Occasionally, you may want to run a program using a different input option, for example
run it using “MFS selected files” rather than “All of front text”. This is simply a
matter of selecting the new input option from the “Take input from” popup just before
running the program. If you want the input option to be permanently changed for the
program, click the “Save settings” button after picking the new input option.
Cancelling a run
To cancel a hAWK program, hold down the <Command> key while typing a <period>.
Program execution should cease within one second.
------------------
Standard input and output
------------------
Drag_on Modules such as hAWK and Read Resource use three disk files to communicate
with you and with the calling application. These text files carry the burden of standard
input/output for Drag_on Modules. If a Drag_on Module requires a large chunk of input
that is not already in an appropriate disk file, the input will be written to the standard
input file “$tempStdIn”, and all normal output from Drag_on Modules is, unless you
specify otherwise, sent to the file “$tempStdOut”. If errors pop up while the Drag_on
Module is running, error messages will be written to the file “$tempStdErr”. These
files are all created and written to automatically as needed, and can be found in the same
folder that contains your “Drag_on Modules” folder.
The file of main interest here is $tempStdOut, which typically holds the results of a
Drag_on Module run. Drag_on Modules don’t show you this file, but can request that the
calling application show it to you. This is always the case with Read Resource, and is
optional with hAWK—it depends on whether you put a check in the “Show stdout”
checkbox in the setup dialog. All of the supplied hAWK programs that write output to
$tempStdOut have saved settings that include putting a check in this box.
Because the results of Drag_on Module runs are by default written to a fixed text
file, you can easily pass the output from one run to the input of another run. For
example, Read Resource creates a formatted text version of a resource and writes
the results to $tempStdOut, which is then shown to you by the calling application.
You can then call a hAWK program to further process this output, by leaving
the $tempStdOut window in front and having the hAWK program take its input
from the front window (pick the “All of front text” option from the “Take input
from:” popup menu). And you can pass the output from one hAWK program to
the input of another in the same way.
A Drag_on Module can only request that the calling application show you the
$tempStdOut file, but whether or not it does so is up to the author of the calling
application. If it doesn’t, you’ll have to Open or Revert $tempStdOut yourself
in order to see the results.
The contents of $tempStdOut are indeed temporary, and will be overwritten by the
next hAWK program, or indeed any other Drag_on Module, that you run. If you want
a permanent copy of the output from a program, use “Save As” to save $tempStdOut
under a new name, or copy the contents to a working window.
hAWK always takes input from a file, and if you are using one of the “front text”
options for input then hAWK will write a copy of the front text to $tempStdIn before
running your program. Output from hAWK programs, which is generated by “print”
and “printf” statements, can be explicitly redirected to any file, but if no redirection
is provided then by default the output from the program is sent to $tempStdOut. The
file “$tempstdErr” will hold error messages if problems pop up while running a
program.
Sometimes you’ll want to take input directly from the file $tempStdOut, without
bothering to use the above method of opening the file and bringing its window to
the front. It is perfectly OK to select $tempStdOut as the input file using the
“Select input file...” option under the “Take input from:” popup. The contents
of $tempStdOut just BEFORE the run will be used as the input, and input from
this “old” version of $tempStdOut will not be affected by anything you write
to $tempStdOut during the execution of your program. Actually, your old
$tempStdOut will be renamed to $tempOutAsInput just before the run, and
the file name your program receives will also be changed. This bit of suberfuge
is necessary since it is not possible to randomly read and write the same file
without things getting horribly confused.
------------------------
About the supplied programs
------------------------
For the most part, the programs you’ll find in the “hAWK programs” folder do useful
things (from the point of view of a C programmer), with just a few of them being of the
traditional “completely useless but illustrating some basic point” kind that are often
foisted on innocent customers by authors who have run out of steam before writing the
manual. There are nearly as many categories of supplied programs as there are
supplied programs, so the following list with brief descriptions is in simple
alphabetical order. The descriptions are brief here because each supplied program
contains a detailed explanation of what it does and how to use it, at the top.
“$RunClip” provides a handy way to run small programs as you explore hAWK,
without having to save them to disk first. You’ll find instructions below, and at
the top of the $RunClip file.
Unless otherwise mentioned, a program sends its output to the file $tempStdOut, and
you will be shown the contents of this file by the calling application at the end of the
run (if it is able to do so). Most programs will accept input from any source, but then
again most programs are especially useful with just one or two input sources.
$EnumSwitch, for example, expects a comma–separated list of enum constants as
input, normally provided by selecting the enum constants in a source code window and
taking input for $EnumSwitch from the selected front text. Running this program on a
batch of MFS selected files is possible, but wouldn’t produce very useful results. Once
you understand roughly what a program does, you should be able to judge what sorts of
input are appropriate for it.
The detailed instructions for running a program can be found at the top of the listing
for the program itself, and you should read through those before running a program
for the first time. For example, with $MFSLister you have to tell it what string
to search for, and this is done by setting a variable with the
“Set variables” button.
Programs which make essential use of the “progress” or “prompt” functions
should be run in “immediate” mode (see “Running hAWK programs”,
section “Concurrent and immediate modes”). To run a program in immediate
mode, hold down the <Shift> or <Option> key while selecting “hAWK” from
your application’s menus. Programs that should be run in immediate mode
are marked with (IMM) just after the program name below.
$AddLineNumbers: will add line numbers to a file. Takes input from one specific
file, and overwrites the contents of the file. Doesn’t number blank lines.
$Chain (IMM): allows you to run one or more small canned programs on your input,
the first program being executed using whatever input you specify, and
the following programs if any taking their input from stdout. You type
in the names of the programs to run in a dialog box, and they are executed
from left to right in the order you typed them. Effectively serves as a
“library” of small tasks. Illustrates using the hAWK() function to execute
a sequence of programs, repeatedly taking input from stdout, and the
“prompt” dialog box.
$Comments: extracts lines that contain C comments. Or rather, at least all
lines that contain comments.
$CompareFiles: prints differences between two versions of a file; for use with
the “MFS selected files” option. Has a couple of options, but should almost
always work fine with the defaults—see instructions if results seem suspicious.
Lengthy miscompares (over 100 lines) will cause it to bog down.
Demonstrates doing everything with functions rather than pattern–action blocks.
$DefineSwitch: generates a “switch” statement, with cases created from a list
of #defined constants. Normally takes input from the selection in your front
text window, output is shown selected in $tempStdOut for copying to your
working window.
$EchoFileNames: for use with the “MFS selected files” option, creates a list of
the file names that were selected.
$EchoFullPathNames: like $EchoFileNames, but generates full path names in
the general form “Disk:folder:folder1:...:folderN:filename”. Full path names
are required when redirecting input and output of hAWK programs.
$EnumSwitch: like $DefineSwitch, but generates the cases for the switch from
a comma–separated list of words, typically enum constants. Initializations
for any of the constants are ignored.
$ExtractExternRefs: list all C declarations encountered that begin with “extern”.
Fast and simple, but will stumble if it encounters “extern” as the first word
in a comment. (Excercise: steal the comment–skipping code from $XRef
to fix this little problem).
$FilesInOrderTest: discussed in the “Advanced topics” chapter way down below.
Demonstrates the technique of taking input from an arbitrary
list of files, the list itself being the sole input you pass to the program.
$FindSetVolEtc: an example of a small program knocked off in a minute to
solve a specific search problem. Searches for a list of specific terms, prints
the file name and line number where found, together with the context of the find.
$FrequencyWord: lists unique words in one or more documents, in declining order
of frequency. Demonstrates associative arrays and the sort command. A companion
to $WordFrequency.
$List_Potential_C_Locals : feed this the body of a C function, and it will return a
list of candidates for declaration as local variables within the function. Contains
a near-complete lexical analyser for C, and produces best results if the calling
application supports the “lookup” function.
$Lockout (IMM): a pathological excess. MUST be stopped with <Command><period>. Displays
a marquee–style message in Chicago or “giant” while you go to lunch. Trivial, but
the code itself is worth looking at (it can archive giant messages to files,
demonstrates two–dimensional arrays, implements severe abuse of the
progress() function). You can set the message before running, by changing
the “message” variable. Some other options available.
$LogDaemon: the only supplied program that must be run in concurrent mode
only. It waits around until you copy the (almost) word "logit", flashes the
menu bar to acknowledge, and then will append the NEXT bit of text you
copy to a specific file, together with a date stamp. Then another flash to
signal that it’s done. This program runs until you type <Command><period>.
See instructions before using, since you’ll need to change the name of the log file.
$LongestLines: will print out a list of the longest lines in one or more files. Use
“Set variables” to set how many lines to print, and how many spaces in a tab
before running. Properly converts tabs to spaces for calculating lengths,
illustrates several basic string functions.
$LookupTest: a demonstration of the lookup() built–in function.
$MFSLister: searches for a string or a regular expression (restricted to checking
one line at a time). Prints file name and line number where found, with optional
printing of the line containing the match.
$MFS_SuperLister: searches for a regular expression or plain text involving
variable white space, can match it even if it spans a variable number of lines (try
that with Grep!). Lists file name and line where found. It’s up to you to
provide the text or regular expression. The innards are much like $MFS_SuperReplace.
$MFS_SuperReplace: multi-file search and replace, searching for a regular
expression or a string of literal text that can span a variable number of lines.
Replacement text can replace or extend the pattern found. Alters the original files,
fully documents changes to stdout. Demonstrates using the hAWK() function
to selectively alter and execute a program, handling a variable number of
input lines at once in a “rolling buffer”.
$Print_MENU_Resource: given the result of Read Resource on a MENU resource,
this program prints a nicely–formatted version of the menu. A sample for doing
your own custom resource or data formatting and content verification, including
all of the necessary basic functions for doing so.
$Print_MPSR_1007: given the result of Read Resource on a “MPSR 1007” resource
(ie marks for a text file), prints out a nice version (see also $Print_MENU_Resource).
$printNF: trivial, prints the number of fields in each input line.
$ProgressTest, $PromptTest (IMM): demonstrate the prompt() and progress() functions.
(The ultimate progress() example is $Lockout; for a nice little prompt()
example, see $YoungMath).
$RoughIndexer: if you dream of automatically generating an index, you can
start here.
$RunClip: for short, disposeable programs to be run concurrently (note that $Type&Run
only runs in immediate mode). The calling application must support passing its
clipboard to hAWK (eg EnterAct). Create your program in the calling app, Copy it,
bring input to hAWK's attention (eg front text or a multi-file selection), then call
up hAWK and select and run $RunClip. Your copied program will be saved to the file
“$hAWKTempProgram”, and then executed using the built-in hAWK() function.
$SortTest: a test of the built–in sort() function, doing dictionary order. For
a real use, see $WordFrequency.
$SortTest_Nums: a sort() test on numbers. Uses rand() to generate the numbers.
$StubFunctions: given a list of C function prototypes, generates empty function
shells for the function definitions.
$TabsToSpaces: converts tabs to spaces in one or more documents, replacing each tab
by the appropriate number of spaces (anywhere from 1 to “spaces_in_tabs”),
consistent with the tab interpretation of THINK C et al. You set the
number of spaces in a tab with “Set variables”, and also whether to overwrite
the original file or make a copy with a new name. Demonstrates some
basic file–handling methods
$Time: just prints out the time, using the TIME built–in variable, and the
time() function for comparison.
$TwoColumnsRight: given a list of numbers in two columns, right–justifies
the numbers in the columns. Demonstrates dynamically building
a printf() format string with variables and string concatenation.
$Type&Run (IMM): for short, disposeable programs, use the dialog box presented by this
program to type in and run your one or two-liner. Since <Return> means “OK”
in the dialog, use <Command><Return> to advance to a new line. Illustrates using
the hAWK() function to save and execute a program.
$Uppercase: changes the first letter in each input field to upper case if
it is a lower case letter. Uses match(), sub(), substr().
$Whazzat: translates C declarations into English. Works best if the calling
application supports the “lookup” function so that special terms in your
declaration (typedefs, struct tags etc) can be diagnosed.
Illustrates using functions instead of pattern–action blocks, retrieving tokens
with string functions while parsing, reformatting long lines for output.
$WordFrequency: a “classic” use for AWK - print sorted list of unique words
in the input, together with the number of times each word is used.
$XRef: generates file and line number listing for your choice of terms in C source
code. Illustrates the hAWK() function, sorting. The calling application must
support the “lookup” function (see “Built–in string and file functions” in the
“Actions” chapter).
$XRef_Full: like $XRef, but doesn’t skip comments and strings.
$YoungMath (IMM): demonstrates the prompt() function while urging you to add
numbers.
---------------------
hAWK program structure
---------------------
From start to finish
A typical hAWK program run progresses as follows:
1 From the hAWk setup dialog, specify the main program to be run, add any library files
that go with it (optional), specify initial values for variables (optional), and build
a list of input text files for the program to work on (optional, but almost always
included).
2 Collect the main program and libraries together into one big program. Reduce it
to a form more suitable for interpretation. Assign initial values to variables if you
have provided any. The list of input files is made available to the program,in the
array ARGV[] of file names.
3 Execute the program: by default, hAWK automatically reads the text from the input
files into memory, one “record” at a time (the default is that a line is a record). If
a record matches one of your specified patterns, then action is taken. Statements may
optionally be executed before and after the input is dealt with. Schematically, a
generic hAWK program looks like
#An abstract hAWK program:
BEGIN {beginning statements}
pattern1 {action statements for pattern1}
...
patternN {action statements for patternN}
END {ending statments}
(--supporting function definitions--)
and the corresponding program execution proceeds as follows:
• execute any supplied BEGIN statements
• read the input files into memory, one record at a time; for each record
check all patterns; if the pattern is TRUE for the current input record,
execute the associated action statements; in C this would look like:
while (get_another_input_record())
{
for (pattern1 to patternN)
{
if (pattern is TRUE)
{
action statements for the pattern
}
}
}
• execute any END statements
4 Unless otherwise specified by redirection, all output via “print” or “printf”
statements goes to the default standard output file, called “$tempStdOut”.
5 Comments in the source code, which begin with a “#” and continue to the end
of the line, are ignored.
BEGIN, END, and pattern–action blocks may occur in any order in the source for
the program. Programs may also contain function definitions, which are
introduced by the “function” keyword, and take the general form:
"function" funcName(parameter1, parameter2,...local variables)
{
statements making up the function body
}
If a function is generally useful, it may be placed in a library file to save duplication.
You’ll find little emphasis on libraries, since it costs very little to duplicate a function
right in the main program, and this makes the programs easier to read.
Library files should be reserved solely for function definitions to avoid confusion.
hAWK automatically reads in your input files one “record” at a time, also breaking each
record into “fields”. The current record is in the built-in variable $0, and the fields
are in $1, $2, …$NF (where NF is another built-in variable giving the number
of fields in the current record). By default a record is the same as a line and fields are
separated by blanks or tabs, so you can think of the default as reading your input one
line at a time into $0 and making the inidividual words available in $1, $2 etc (but note
that all punctuation except blanks, tabs, and returns will still be present in the fields).
For example, if the current line in an input file reads
"for (i = 0; i<7; ++i)"
then that will be the content of $0, and the fields will be
$1 = "for", $2 = "(i", $3 = "=", $4 = "0;", $5 = "i<7;", $6 = "++i", with
NF, the current number of fields, set to 6.
Here’s a real program to give you a taste ("$EnumSwitch", in the “hAWK
programs” folder):
#$EnumSwitch
#Select a bunch of enums, and run Hawk on the front selection
# -optionally select the entire enum body from '{' to '}' with Balance
#Leave "Show std out" and "Select all of stdout" checked
{ gsub(/=[^,]*/, " ")#remove initializations for the enum constants
gsub(/=(.)*$/, " ")#ditto
gsub(/[,{};]/, " ")#remove remaining punctuation, leaving just the enums
for ( k = 1; k <= NF; k++)#build an array containing the enum names
case[++i] = $k
}
END { print "switch (??)"
print "\t{"
for (k = 1; k <= i; ++k)
{
print "case " case[k] ":"
print "\t"
print "break;"
}
print "default:"
print "\t"
print "break;"
print "\t}"
}#end program
Given a list of names from an enum definition, such as
"{left, right, up, down, twilightZone = 999}" this program generates
switch (??)
{
case left:
break;
...etc…
case twilightZone:
break;
default:
break;
}
To run this program: first select a list of comma-separated names (typically use the
contents of an enum definition); select "hAWK" from the calling application’s menu;
select "$EnumSwitch" from the "Main program" popup; (note the "Take input from:"
popup will then read "Front text selection"); and click the Run button. The generated
"switch" statement will appear in a window called "$tempStdOut", ready to be copied
and pasted into your working window.
Grouping and breaking lines
The rules for organizing and grouping your program lines differ a bit from the rules
for C; a <Return> (also called newline) can stand for a semicolon after most hAWK
statements, the price of this being that lines cannot be arbitrarily broken as in C,
to avoid confusion between ending a statement and merely continuing it to the next
line. The rules below are listed in rough order of their impact on whatever C
formatting habits you have.
• When in doubt, use a backslash '\' immediately followed by a <Return> to continue a
long line, as with preprocessor macro’s and strings in C. For example:
x = y + (z - 1) + SomeFunction(param1, param2\
, param3, param4) + w;
• Long conditional tests can be broken to the next line immediately after any logical
operator (&&, ||, !). Eg:
if ( lineNumber >= maxLines &&
$0 != "")
• A long line may be broken after a comma, eg
x = y + (z - 1) + SomeFunction(param1, param2,
param3, param4) + w;
• The '{' that begins an action should be placed on the same line as the end of the
pattern for it, eg
FNR == 1 || FNR == 2 ||
FNR == 3 { #Note '{' is on same line as end of pattern
print
}
• A comment in hAWK begins with a '#' and continues to the end of the line. A comment
can be placed at the end of any line except a line that is continued with a backslash and
<Return>.
• Group multiple statements together with '{' and '}', as in C, eg
if ($0 ~ /TEST/)
{
print "TEST on line", FNR
++numTests
}
• When in doubt, terminate a single statement with a semicolon. Multiple statements
may be placed on one line if separated by semicolons, eg
if (a >= b) print "a is bigger"; else print "b is bigger";
or
do ++x; while (x < maxForX);
• In if-else and do-while constructs, the “else” and “while” keywords should either
be placed on a new line or preceded by a semicolon or '}'. In other words, clearly
signal the end of the “if” or “do” part, so that the “else” or “while” doesn’t pop
up by surprise:
these are OK;
if (a > b) ++b; else ++a
if (a > b) ++b
else ++a
do {--x; print x} while (x > 0)
these are not;
if (a > b) ++b else ++a
do ++x while (x < maxForX);
----------------------
The Command line and ARGV[]
----------------------
To run a hAWK program, you must tell hAWK which program to run, and what files
to use for input data, with other optional details. Classically, these file names etc
are passed to AWK in an array of pointers called argv; hAWK works the same way,
but these names are generated for you when you set up a hAWK run using the
setup dialog, saving you the work of typing them all in each time.
All you really need to know about the command line is that, at the time a program
is run, the names of the input files it is being asked to deal with are contained in
the array named ARGV, and the number of input files equals ARGC-1 (where
ARGV is a built-in array name, and ARGC is a built–in variable name). Input file
names are full path names, so typical contents are
ARGV[1] = "Disk:folder:...:folder:First_Input_file"
...
ARGV[ARGC-1] = "Disk:folder:...:folder:Last_Input_file".
Running the sample program “$EchoFullPathNames” on some input files will provide
you with a specifc example—why not give it a try? Use your calling application to
select some files for multi–file operations (“searching”), then run
$EchoFullPathNames and see what results. This is the complete program:
BEGIN {
for (i = 1; i < ARGC; ++i)#note ARGV[0] is just "hAWK"
print ARGV[i]
}
Details follow on the command line generated by hAWK’s setup dialog, in case you
are interested in modifying hAWK. You may also find this background helpful if you
use the hAWK() function, which executes another program from within a program
and requires an explicit command line as its argument (see ch. Q, “The hAWK function”).
The command line passed to hAWK from the setup dialog takes the general form
hAWK -fProgramName {-fLibraryName} {-vVariable=value} --
{InputFileName}
where the {} brackets indicate that an item may be repeated or omitted. For example, if
running a program "$BigSort" with supporting library "Sort_Routines", with the files
to be sorted being "Text1" and "Text2" then the command line passed to hAWK by the
setup dialog will be something like
hAWK -f$BigSort -fSort_Routines -- HardDrive:Code Folder:Sub Folder:Text1
HardDrive:Code Folder:Sub Folder:Text2
The "-f", "-v", and "--" are little markers that hAWK uses to tell what's what.
"-f" means a program file, "-v" means a variable assignment, and "--" means
that ony input files (if anything) follow this marker.
By the time the command line becomes available to you within your hAWK program,
the array "argv" is a hAWK array of strings called "ARGV" that contains only "hAWK"
in ARGV[0] followed by the names of the input files in ARGV[1], ARGV[2] etc,
and ARGC is set to the number elements in the ARGV array, namely the number of
input files plus one. The last input file name is ARGV[ARGC-1].
Normally, the input file names are the only things on the command line of interest
that you don't already have access to. You'll have acess to the variables anyway,
and one can't help thinking that it would be an odd program indeed that needed to know
its own "ProgName".
Here's a hAWK program that prints a complete list of the input file names passed
to it ($EchoArgs again):
BEGIN {
for (i = 1; i < ARGC; ++i)#note ARGV[0] is just "hAWK"
print ARGV[i]
}
If you included this block in $BigSort above, then the output would be something like
HardDrive:Code Folder:Sub Folder:Text1
HardDrive:Code Folder:Sub Folder2:Text1
—as you can see, you're getting the full path names of the file, not just the file names.
Here's a version that prints just the file names proper:
BEGIN {
for (i = 1; i < ARGC; ++i)
{
n = split(ARGV[i], names, ":")
print names[n]
}
}
for which the output would for example be
Text1
Text2
The important thing to note here is that hAWK deals with full path names for files,
especially relevant if you are redirecting input or output (more on this later).
When you assign values to variables using the "Set variables" button in the setup
dialog, the result is the same as if you assigned the value in the BEGIN block of
your program. However, you should NOT use quotes if you are assigning a text
string to a variable using "Set variables"—for example, the variable assignment
find=text to find
within the "Set variables" dialog is equivalent to the statement
BEGIN {find = "text to find"}
within your actual program. This is meant to be a convenience, but is perhaps a
nuisance, in that any spaces between the '=' and the value are significant:
find =text
is not the same as
find = text
—that space between the '=' and the 't' of "text" will be included in the string for "find".
The "Set variables" button can be used to set the value of any hAWK variable,
whether your own or a predefined (built-in) variable, and it is easier to change
a variable this way than to edit the program itself. Up to 10 variables can be set
with "Set variables", and your variable settings will be saved for the next run
if you click the "Save settings" button in the setup dialog. For an illustration and
more details, see the “Running hAWK programs” chapter.
----------------
Variables and constants
----------------
Variable names and types
hAWK has many built–in variables, and you can use your own. A variable of your
own devising springs into existence when you first use it, with no need to declare
it (excepting perhaps local variables for functions, which need to be not so much
declared as “mentioned”—see “Local variables in functions” below).
Variable names in hAWK take the same form as C names: a letter or underscore
followed by any number of letters, underscores, and numbers.
hAWK has both scalar variables and one–dimensional arrays. The value of a variable or
array element may be a (floating–point) number OR a string, and the specific type at
any time depends on how you use the variable. While numeric values in hAWK are
nominally floating–point, if you consistently use a variable as an integer you will
get predictable results. For example,
for (i = 0; i <= 1; ++i)
print i
will print two values, 0 and 1, guaranteed.
Uninitialized variables have the numeric value 0 and the string value "" (the null, or
empty, string). Note this differs from a variable that has been explicitly initialized
to zero, for in this case while the numeric value will be zero the string value
will be "0".
Constants
Constants can be integers, floating–point numbers, or strings. For example,
x = "A string of text";
y = 7;
z = .31415926E1;
pat = "[_A-Za-z][_A-Za-z0-9]*"; (a string to be interpreted as a regular
expression - it matches a hAWK variable name).
Record and field variables
After the BEGIN block(s) of a program have been executed, a hAWK program proceeds
to automatically retrieve records from your input files one at a time to the built–in
variable $0, and individual fields in the current record can be accessed with the
built–in variables $1, the first field, $2 etc up to $NF, the last field, where NF is
a built–in that records the current number of fields. Records are separated according
to the string contained in the built–in record–separator variable, RS. By default
this contains just a return, ie RS = "\n", so a record is the same as a line. You can
change the value of RS, and setting RS to ""(the null string) will cause empty lines
to be treated as the record separator. Note that the record separator itself is
trimmed from the record.
Similarly, fields are separated in accordance with the value of the field–separator
variable, FS. By default the field separator is a regular expression standing for
“one or more blanks or tabs”, and as a nicety if you use the default value of FS then
any leading blanks or tabs will be trimmed away from the first field, $1.
References to non-existent fields (fields after $NF ), produce the null-string.
However, assigning to a non-existent field (e.g., $(NF+2) = 5 ) will increase the
value of NF , create any intervening fields with the null string as their value, and
cause the value of $0 to be recomputed, with the fields being separated by the value of
OFS, the output field separator. A negative field number is an error.
Many functions in hAWK allow you to optionally specify a string for them to work on,
and if you don’t specify a string then it uses $0, the current input record, by default.
For example,
print "some text"
will do just that—print the string "some text" to the standard output, whereas
print
all by itself, will print the contents of $0 to stdout, and thus it has the same effect as
print $0
Note that “print” tags on the contents of ORS, by default a return, to its output, so in the
default case the return that was trimmed away when retrieving the current input
record is added back. Thus, the hAWK program that consists of the one line
{print}
will echo all of its input to stdout (the file $tempStdOut) without change, though a flurry
of activity involving returns takes place behind the scenes.
This little program prints the individual fields of each input record to individual lines:
{for (i = 1; i <= NF; ++i)
print $i
}
—note that the field specifier can be a variable as in “$i”, and doesn’t have to be
a constant.
Built–in variables
hAWK's built-in variables are:
ARGC the number of input files plus one
ARGV array of command line arguments. The array is indexed from 0 to ARGC - 1,
the input file names being ARGV[1] through ARGV[ARGC-1].
Dynamically changing the contents of ARGV can control the files used for data.
FILENAME the name of the current input file. If no files are specified on the command line,
the value of FILENAME is "-". A hAWK program may do all of its work in a BEGIN
block, with no need for input (generating a list of random numbers for example).
FNR the input record number in the current input file. Reset to 1 when starting a new
input file. Hence the pattern “FNR == 1” detects the start of each file.
FS the input field separator, a blank by default. If the default FS is used then
leading blanks and tabs are trimmed from $1.
IGNORECASE controls the case-sensitivity of all regular expression operations.
If IGNORECASE has a non-zero value, then pattern matching in
rules, field splitting with FS , regular expression matching with
~ and !~ , and the gsub() , index() , match() , split() ,
and sub() pre-defined functions will all ignore case when doing
regular expression operations. Thus, if IGNORECASE is not equal to
zero, /aB/ matches all of the strings "ab", "aB",
"Ab", and "AB". The initial value of IGNORECASE is zero,
so all regular expression operations are normally case-sensitive.
NF the number of fields in the current input record.
NR the total number of input records in all input files seen so far.
OFMT the output format for numbers, %.6g by default.
OFS the output field separator, a blank by default.
ORS the output record separator, by default a newline.
RS the input record separator, by default a newline. RS is exceptional
in that only the first character of its string value is used for
separating records. If RS is set to the null string, then records are
separated by blank lines. When RS is set to the null string, then
the newline character always acts as a field separator, in addition
to whatever value FS may have.
RSTART the index of the first character matched by match(); 0 if no match.
RLENGTH the length of the string matched by match(); -1 if no match.
SUBSEP the character used to separate multiple subscripts in array
elements, by default "\034", some kinda up arrow very rare in text.
(and three added for the Macintosh version)
RUNERR short for "run error", a file name that you can use to print your own error
messages to, as in print "Error during run" > RUNERR. Default name
is $tempRunErr, and you'll find the file in the same folder as $tempStdOut.
STDPATH path name that can be prefixed to any file name you wish to be written to the
same folder as stdout ($tempStdOut). Typically looks like
"Disk:folder1...:THINK C folder:" and typical use looks like
outname = "MyOutFile"
fullOutName = STDPATH outname;
print "something" > fullOutName;
TIME at start of run, eg "Sunday, October 13, 1991 07:58 AM"
Local variables in functions
Function definitions in hAWK resemble those of C a bit, but local variables require
an odd syntax. They must be listed in the parameters of the function, after the real
parameters, in order to be treated as local. All other variables in hAWK have global
scope. For example, in
function SumArray(arr, index, sum)
{
for (index in arr)
sum += arr[index];
return sum
}
the only real parameter is the array name “arr”. This function sums up the contents of
the array and returns the sum, used as in “sum = SumArray(x);” where x is an
array containing numbers. The variables “index” and “sum” look like orphans there
in the parameters, but this is just the hAWK way of declaring local variables. Both
index and sum cannot be affected by any statements outside the SumArray function (that
is, they are local in scope), and as a bonus hAWK initializes even local variables to 0
each time the function is called. Functions are described in more detail a little later in
the chapter “User-defined functions”.
Setting variables on the command line
When variables are set using the “Set variables” option in the setup dialog, no quotes
should be used around strings, and no space should be put between the equals sign and
the string or number unless you want it to be included in the value. For example, the
equivalent of
BEGIN {find = "some text to find"; first = 7;}
in the “Set variables” dialog would be
find =some text to find
first =7
(the space before the equals sign is optional).
Conversion between numbers and strings
Conversion of a variable’s value between number and string is automatic in hAWK when
circumstances call for it, and can be forced by you as well. When an operator is strictly
numeric, the value of its operands will be forced to numbers if necessary, and similarly
if an operator expects to deal strictly with strings then values will be forced to strings.
For example, in
a = "102";
b = a + 1;
“a” starts out as a string, but the “+” operator deals strictly with numbers, so “a”
is converted to the number 102.0 on the second line.
And in
a = 27;
b = "trombones";
c = a b; #there is a space between a and b
we see the invisible “concatenation” operator at work. Two variables or constants separated
by just a space are treated as strings by hAWK and concatenated together. So “a” is converted
to a string on the third line, and “c” ends up holding the string "27trombones".
Some operators (all of the comparison operators == <= >= etc for example) can accept
either strings or numbers. When this is the case, the rule is that the operation proceeds
numerically if both operands are currently valid numbers, but proceeds as a string operation
otherwise.
You can force a variable to be treated as a string by concatenating the null string to it. For
example, no matter what the values of a and b are, the comparison
a "" == b
will proceed as a string comparison.
And you can force a variable to be treated as a number by adding 0 to it, as in
a + 0 == b + 0
but note in this case that both operands should be forced to numeric type.
Arrays
Arrays are subscripted with an expression between square brackets, arr"["expr"]".
Array values can be numbers or strings, but the index is always interpreted as
a string. For example, when you write
arr[1]
the 1 is converted to the string "1" for use as the array index, so arr[1] is
the same as arr["1"]. This sort of array is called “associative” since it can
associate one string of text with any other, eg
arr["John Henry"] = "was a log-drivin man"
If the index expression is an expression list ( expr1, expr2, expr3,... ) then the array
subscript is a string consisting of the concatenation of the (string) value of each
expression, separated by the value of the SUBSEP variable, which is by default
“\034” (decimal 28, an up arrow). This facility is used to simulate
multiply–dimensioned arrays. For example:
i = "A" ; j = "B" ;k = "C"
x[i, j, k] = "hello, world"
assigns the string "hello, world" to the element of the array x
which is indexed by the string "A\034B\034C".
The special operator "in" may be used in an "if" statement to see if an array has
an index consisting of a particular value:
if (val in array)
print array[val]
If the array has multiple subscripts i j k, use
if ((i, j,k) in array) instead . The alternate
if (array[val] != "")
actually creates the array array[val] element if it does not exist, so using “in”
is usually better.
The "in" construct may also be used in a for loop to iterate over all the elements of an
array:
for (i in arr)
delete arr[i] # or print arr[i] , or print i, arr[i]
An element may be deleted from an array using the delete statement. New elements should
not be added to an array while looping over it with the "in" for-loop, since hAWK isn’t
quite smart enough to handle that very well.
Behind the scenes, indexes for an array are stored in a hash table, Retrieval of an array
element takes constant time up to a moderate array size (~1000), but as array size
increases retrieval time will increase as a linear function of the size.
Some array examples:
for (i = 1; i <= 100; ++i)
x[i] = i;
This does what you would expect, creating x[1] =1, ...x[100] = 100. Note, however, that
while i is treated as an integer in the for loop, it is converted to the string representation
for that number when used as the index for x.
for (i = 1; i <= NF; ++i)
wordCounter[$i] += 1;
Here we see the real power of hAWK’s associative arrays. $i is a string containing a field
on the current input line, and this string is used as an index into the wordCounter array.
If there is no element in the array yet for the index, a new element is created (and
initialized to 0/the null string, as for regular variables). The array element itself holds
just a count of how many times the string has been seen. Obviously, you can’t access these
array elements by incrementing a numeric index—here’s where “in” comes in:
for (word in wordCounter)
print word, "was seen", wordCounter[word], "times."
prints out the words used to index wordCounter, together with the word counts, a sample
line being
parsimonious was seen 1 times.
The one drawback of this simple example is that the words will be printed in a rather
arbitrary order (internally, the entries in a hash table are being accessed). However,
even this shortcoming can be overcome. The sample program “$WordFrequency”
shows how to sort an array such as wordCounter into dictionary order on the index.
while (getline x > 0)
lines[++n] = x;
The “getline x” will retrieve records from your current input file to the variable x,
from the current position to the end of the file. Each record is saved away as an element
in the array “lines”. Here the index is a number (technically the string for the
number) and the element is a string —the reverse of the last example.
times[3,7] = 21;
The actual index is "3" "\034" "7" concatenated together. A multi-dimensional
array can be run through in the same way as in C:
for (i = 1; i <= iMax; ++i)
{
for (j = 1; j <= jMax; ++j)
{
print times[i,j] #or whatever
}
} # note "for (k in times) print times[k]" could also be used.
-------
Patterns
-------
Patterns and actions
At the top level, a hAWK programs consists of patterns and actions, of the general form
pattern { action }
When a pattern evaluates to true (non–zero), the corresponding action is taken.
Patterns resemble the conditions found in a C if-statement, but several kinds of
patterns, notably BEGIN, END and patterns using the matching operator '~', are not
found in C. As described earlier, hAWK will automatically read in your input one
record at a time to the variable $0, and each pattern is evaluated in turn; if the pattern
is true for the current input, then the action statements are executed.
A missing pattern evaluates to true, so action statements with no preceding pattern
are executed for every input record. A missing action is equivalent to
{ print }
which prints the input record to stdout. It’s equivalent to {print $0}, by the way.
Here’s a sample pattern-action block that is often useful:
FNR == 1 { z = split(FILENAME, names, ":") }
FNR stands for "file number of records", reset to 1 at the beginning of each input file.
FILENAME is a variable holding the full path name of the current input file. The split on
':' splits FILENAME into an array, treating the ':' as the element separator. Often, one
wants just the file name proper without the disk and folders, and this is given by
names[z]. For example, if FILENAME = "Disk:folder:thefile" then the split produces
names[1] = "Disk", names[2] = "folder", and names[3] = "thefile", with "z" being
set to 3. The statement "print names[z], FNR" will print the current input file name
and current line number to stdout.
The “Summary of patterns” section at the end of this chapter contains a small program
that will let you try out patterns as they occur to you. Or you could use $RunClip.
BEGIN and END
BEGIN and END are two special kinds of patterns which are not tested against the input.
The action parts of all BEGIN patterns are merged as if all the statements had been
written in a single BEGIN block. They are executed before any of the input is read.
Similarly, all the END blocks are merged, and executed when all the input is exhausted
(or when an exit statement is executed). BEGIN and END patterns cannot be combined
with other patterns in pattern expressions. BEGIN and END patterns cannot have
missing action parts.
BEGIN {FS = ",[ \t]*|[ \t]+"}
sets the field separator to either a comma followed by optional blanks and tabs or
one or more blanks and tabs—a common field separator in a real database.
END blocks are often used to finish up after all the input has been seen, as in this
little program:
{out[++n] = $0}
END {for (i = n; i >= 1; --i) print out[i]}
which accumulates all input records in the array “out”, and then at the end
prints out the records in reverse order.
Expressions as patterns
Simply put, an expression is any sensible combination of variables, operators, and
(rarely) function calls. When an expression used as a pattern evaluates to a non–zero or
non–null result, the action following it will be carried out.
The most common sort of expression used as a pattern is the comparison, involving the
operators ==, <=, >=, >, <, and !=. These can be used with any hAWK variable or
calulated result, and it is a refreshing improvement over C to be able to test
two strings for equality with the simple “a == b” instead of “!strcmp(a,b)”.
Comparison patterns quite often involve tests on the current input, such as
“$1/$2 >= 100”, “$3 == "Wilhelmina"”, “$0 != ""”, the last testing that
the current input line is not empty. Built–in variables are also popular, as in
the “FNR == 1” example a few paragraphs above, which detects the start of an
input file. Your own variables can of course appear, as in
$1 != lastFieldOne { print "New field one is", $1
lastFieldOne = $1
}
which prints the contents of the first field on the input line whenever it changes.
In a comparison, if both sides are numeric then the comparison is made numerically,
but if one side evaluates to a string then the comparison is done in terms of strings,
with the other side first being converted if necessary to a string.
String-matching patterns
The matching operator, denoted by a tilde (~), allows you to detect whether one string
contains another string, though technically that other string is treated as a “regular
expression”. More on regular expressions in just a minute, but for now you can form a
regular expression to look for from a string of characters by putting a forward slash
before and after them. For example, if you wish to determine if the current input line
contains the string "exception", then the pattern
$0 ~ /exception/
will do it. Note that it could match the line
"while this is not an exceptional case, there are other"
that is, the match does not have to be an entire word.
By default if you omit the string for the matching operator to check against, and further
omit even the matching operator, leaving just the regular expression enclosed in slashes,
then the match will be done against the current input line $0. In other words,
/regular expression/ {action}
is the same as
$0 ~ /regular expression/ {action}
—and since even the action is optional (recall the default is to print $0), about the shortest
hAWK program you can write is
/a/ #equivalent to $0 ~ /a/ { print $0 }
which will print any input line containing an “a” to stdout.
To match punctuation explicitly in your expression you should precede it with a
backslash, eg /question\?/, /the end of the sentence\./, /array\[index\]/.
You can use quotes instead of the forward slashes to surround the text of your regular
expression with the same results. In this case, though, the matching operator must
explictly appear. Eg
$0 ~ "Mars" {print "red planet detected on input line", FNR}
And to match punctuation explicitly inside the quotes, you should precede the punctuation
with two (that’s right, two) backslashes. For example, to match "the end." use
string ~ "the end\\."
Using forward slashes instead of quotes around your regular expression has three small
advantages; matching against $0 doesn’t need to be fully written out, only single escapes
are necessary to match punctuation, and after a while the forward slashes will stand out
as you read your programs, signalling a matcher.
The negation of the matching operator, “!~”, allows you to determine if a string does
not contain some regular expression, as in
$2 !~ /A/ {print "Error, second field does not contain the letter A"}
and any points mentioned above for ~ apply to !~.
Regular expressions
Regular expressions aren’t as hard to use as a first impression suggests, and if you try
out a dozen you’ll be hooked, guaranteed. In regular expressions certain characters
have special “powers” that allow you to search for entire related groups of strings
with a single specifying string. Consider that an ordinary “find” command will not let
you completely match the following variations of a string: plurals; possessives;
variable blanks, tabs and especially returns between the words of a string; one or more
alternate words in the string; the complete word that contains some special substring;
two or more complete strings at once (one or the other).
A regular expression is nothing more than a string of text with optional special
“metacharacters”, and in most cases the string to be used can result from
the evaluation of a variable, or the concatenation of several strings or variables.
This means you can build the regular expressions for your program during the
execution of your program, modifying them on the fly to suit changing circumstances.
Parts of a regular expression can be grouped (with ordinary parentheses), and later in
the regular expression or in a replacement string can be referred to by the group “tags”
\1, \2, ... \9 where \1 refers to the group started by the first left parenthesis, \2 to
the second, etc. These allow you to match a small pattern within the context of a larger
one, detect duplicate expressions, change the order of the groups and so on. Note that
parentheses have the highest precedence of all regular expression “operators”, so
they serve two purposes; changing the order in which the metacharacters apply, and
marking the boundaries of a group, for later reference via \1..\9. More on this in a bit.
Regular expressions are built from ordinary characters, the escape sequences
\t \n \b \B \w \W \< \> \1 \2 \3 \4 \5 \6 \7 \8 \9
and from the metacharacters
\ ^ $ . [ ] | ( ) * + ?
which are the ones with the special powers mentioned above. As you saw in the above
section, if a regular expression contains no metacharacters then it behaves like an
ordinary “find” string in that each character in the regular expression must match
a character in the string being searched. The following table summarizes all
character usage in a regular expression (where a b c are ordinary characters,
m is a metacharacter, r is a regular expression, and d is a digit):
c matches the non-metacharacter c itself
\m matches the literal character m, eg \$ matches the dollar sign.
. matches any single character except newline.
^ matches the beginning of a line or a string.
$ matches the end of a line or a string.
[ abc... ] character class, matches any one of the characters a or b or c etc... .
[^ abc... ] negated character class, matches any character except abc... and newline.
(Ranges of characters may be abbreviated in character classes, as in
[0-9] which matches any digit, [A-Za-z] which matches any letter,
[^0-9] which matches anything but a digit).
\w matches a “word” character, exactly equivalent to [0-9A-Za-z]
\W matches a non-word character, ie [^0-9A-Za-z]
\< matches the beginning of a word.
\> matches the end of a word.
\b matches the beginning or end of a word (a word boundary).
\B matches the boundary (beginning or end) of a set of non-word characters.
\t matches a tab.
\n matches a newline (the Return key).
r1 | r2 alternation: matches either r1 or r2, eg "blue|green"
r1r2 concatenation: matches r1 followed by r2 .
r + matches one or more r 's.
r * matches zero or more r 's. (Note that zero r’s can be anywhere in the text)
r ? matches zero or one r 's.
( r ) grouping: matches r. Parentheses have two distinct uses; to override
default precedence of metacharacter operators, and to tag a subexpression
for subsequent reference.
\1...\9 stand for whatever text the first through ninth set of parentheses currently
match, counting opening parentheses from left to right. Note that if the
pair of parentheses has a + or * or ? operator after it, then all of the
matches are included, eg /(foo)+bar/ applied to "foofoofoobar" will set
\1 to "foofoofoo". To get just the first foo, use /(foo)\1*bar/ - then
\1 is set to "foo". (Perl users note this is the opposite of what
you are used to).
\ddd is interpreted as an octal number, as in C. The digits exclude 8 and 9,
needless to say, and there can be from 1 to 3 digits in the number.
Note that \1 through \7 are interpreted as subexpression tags unless
followed immediately by another octal digit (eg \23 is not tag 2 followed
by a 3, it is the octal number 19 decimal). \8 and \9 are always tags,
since 8 and 9 are not octal numbers. To refer to octal numbers 1 to 7,
use \01 to \07. To follow a tag with a low number (eg \2 followed by 3),
use the octal representation of the number (eg \2\063 -- \063 equals
51 decimal, the ASCII code for 3).
The metacharacters ^ and $ to match the beginning and end of strings, and
\b \B \< \> to match various boundaries don’t actually match any characters;
rather they force alignment to a particular text position. For example,
/\brun\b/ will always match just “run” if it matches anything, but will
not match "runner" or "brunt". By comparison, /\Wrun\W/ won’t match
“runner” or “brunt” either, but it will include any non–word character that
happens to come before or after the word “run”. Normally you won’t want to
include leading or trailing spaces etc in the match.
Parentheses () have the highest precedence, allowing you to override default
precedence when needed. The “repetition” operators * + ? have the next–highest
precedence, followed by concatenation, with alternation having the lowest precedence of
all. For example, in abc*d the * applies only to the c since the repetition operator acts
before concatenation, and in abd|def the | applies to abd and def since concatenation
binds them together into little groups of three before alternation can play.
Regular expression can be used to just locate an instance of a pattern, as in
$0 ~ /extern/
but they can also be used to specify text for replacement, by using the “sub” and
“gsub” functions. Looking ahead just a bit, these functions take a regular expression as
the first argument, the string to use for replacement as the second argument, and the
string to do the search and replace in as the third argument, with $0 used by default if
there is no third argument. “sub” does a single substitution on the text, and “gsub”
does all possible non-overlapping substitutions. Within the replacement strings of
these functions, you can use \1 through \9 to refer to text currently matched by tagged
subexpressions, and the ampersand “&” stands for all of the text that was matched.
To put a plain ampersand in the replacement, use “\&”.
At this point some considerable exampling usually helps:
The quick brown matches just that, "The quick brown". Note it would match
"The quick brown" in "The quick brownie".
red fox\. matches "red fox." (the period must be escaped for a literal match).
[ \t] matches a single space or tab ( that’s a space before the \).
[ \t]+ matches any consecutive run of spaces and tabs in any mix.
[0-9]+ matches an integer (read “one or more digits”)
[+-]?[0-9]+ matches an integer, together with optional preceding sign.
\<[A-Za-z'’-]+\> matches an English word.
houses? matches "house" or "houses".
m(iss)*ippi matches "mippi", "missippi", "mississippi", "missississippi", etc.
ar*g matches "ag", "arg", "arrg", "arrrg", etc.
MyFunction\( matches "MyFunction(".
array\[index\] matches "array[index]".
array\[.+\] matches "array[i]", "array[j]", "array[2*q-1]", etc.
\\([0-7]|[0-7][0-7]) matches "\d" or "\dd" where d is an octal digit.
([^\\]?|(\\\\)+)" (horrors, be brave) matches an unescaped quote or a quote
preceded by an even number of backslashes—in other words
a true quote in C. The backslash is a metacharacter, so matching
one literally requires a backslash before the backslash.
The[ \t]+quick[ \t]+brown matches "The quick brown" with variable spaces and tabs
between the words.
\/\* matches the start of a C comment, "/*". The forward slash is
escaped so that you can place the whole regular expression inside
forward slashes. The escape before '/' would not be needed if you
placed the expression inside quotes, but then you would need two
escapes before the '*', ie "/\\*".
\/\*.*\*\/ matches all of a one–line C comment, "/* - anything - */".
^Z matches a 'Z' at the beginning of a string.
^. matches the first character of a string.
.$ matches the last character of a string.
^.*$ matches any string completely (not much use).
^A..$ matches any string which is three characters long, the first
being an 'A'.
^(A|B).* matches all of any string that begins with 'A' or 'B'.
^[AB].* does likewise.
(\w|_)\w* matches a C term, or integer constant.
((->)|(\.))(mem\b) matches “mem” when it is immediately preceded by “->”
or “.”, and is not the beginning of a longer word. For
replacement purposes in a “sub” or “gsub”, the part
before “mem” is given by \1, and mem itself is \4.
gsub(/((->)|(\.))(mem\b)/, "\1\4ber") will turn “->mem” into “->member”
and “.mem” into “.member” everywhere in the current
input line $0, ignoring things like “remember” or
“->memories”.
gsub(/\bFuncName([ \t]*\()/, "FunctionName\1") will replace “FuncName” by
“FunctionName” everywhere in the current input line
$0, provided it is followed on the same line by an opening
parenthesis, with optional spaces or tabs between the name
and “(”. The match extends from the “F” of
“FuncName” up to and including the “(”, so the “(”
and any intervening white space must be put back into
the replacement string by tagging them in parentheses
and using \1 after “FuncName” to refer to what was
matched by the first set of parentheses in the pattern.
This program prints all input lines containing one-line comments:
/\/\*.*\*\// {print}
(since {print} is the default action, it could be left out).
Within a character class most metacharacters are taken literally. The exceptions are
the escaping backslash \, the negating ^ (only at the beginning), and the range hyphen -
(only between two characters). For example,
[A-Za-z-] matches an English word, hyphens included
[-A-Za-z] does the same
[\-A-Za-z] also does the same (the '\' is unnecessary but harmless)
^[^^] matches any single character that is not a '^' at the beginning of a string
[\^] matches a '^'.
The toughest metacharacter to remember is the '^' which has three meanings: at the beginning
of a character class it signals a negated character class; outside of a character class it matches
the beginning of a string; and when escaped or not the first character in a character class it
matches a literal '^'.
Regular expressions are “left greedy”; where there could be more than one match in a
string, a regular expression matches the leftmost one, and extends the match as far as
possible. For the implications of this, see the discussion of the “match” operator in the
“Built–in string and file functions” section of the next chapter, “Actions”.
Now that we’re starting to get the hang of things, more examples using the replacement
functions “sub” and “gsub” mentioned above. The format is sub(r,s,t) where r is a
regular expression, s is the replacement string, and t is the string in which the search
and replace is to be done. The contents of t before and after the sub are spelled out below.
using t = "Don’t run that prune over, runt!":
sub(/run/, "fly", t) turns t into "Don’t fly that prune over, runt!"
gsub(/run/, "fly", t) turns t into "Don’t fly that pflye over, flyt!"
gsub(/\brun\b/, "fly", t) turns t into "Don’t fly that prune over, runt!"
gsub(/run/, "t&k", t) turns t into "Don’t trunk that ptrunke over, trunkt!"
using t = "#define FOO 1":
sub(/#define\W+(\w+)\W+([0-9]+)/, "int \1 = \2;",t) turns t into
"int FOO = 1;" (\W+ means one or more non-word characters, \w+
means one or more word characters, [0-9]+ means one or more digits;
two groups are tagged).
Three programs are supplied to help you do general–purpose listing of matches or
search–and–replace; $MFSLister searches for either plain text or a regular expression
with “Set variables” in the setup dialog, and lists file name/ line number of all
single–line matches to stdout; $MFS_SuperLister does much the same, but finds
matches that span a variable number of lines; and $MFS_SuperReplace does the
ultimate search and replace, matching either plain text or full–blown regular
expressions over a variable number of lines, handling any number of files at once,
documenting the (post–change) locations of all changes to stdout. Heck, it even prints
the fragments of original text before the changes, so that if you mess up you can at least
(manually) undo the damage. (Exercise: write $MFS_Undo_SuperReplace).
Compound patterns
The logical operators ||, &&, and ! can be used to combine simple patterns into compound ones.
These operators function the same as in C, specifically: || is the inclusive–or operator; && is
the and operator; and ! is negation, with evaluation of a compound pattern proceeding only as
far as necessary to determine whether the whole pattern is true or false.
Some examples:
$1 ~ /DATA/ && $2+0 > 0
is true when the first field contains the string "DATA" and the second field is numeric and
greater than zero. If the first field does not contain "DATA" then the second field is not checked.
$1 == "DATA" || $1 == "INFO"
is true when the first field is exactly equal to "DATA" or "INFO". The check for "INFO" is
performed only if the check for "DATA" fails.
$2 != 0 && !($3/$2 > 10 || $3/$2 < 1)
first checks that $2 is not zero, to avoid dividing by zero, and then evaluates to true if
$3 divided by $2 falls in the range 1 to 10.
The ? : operator can be used to choose between two patterns, and is like the same
operator in C. If the first pattern is true then the pattern used for testing is the second
pattern, otherwise it is the third. Only one of the second and third patterns is evaluated.
$2 != 0 ? $3/$2 > 1 : $3 == 0
first checks to see if field 2 is non–zero; if so, the pattern is true if $3/$2 > 1; otherwise,
the overall pattern is true if field 3 is also zero.
Range patterns
Range patterns consist of two patterns separated by a comma. Given
pattern1, pattern2
this evaluates to true for the first input line that matches pattern1, and thereafter is
true up to and including the first line encountered that contains pattern2. Both patterns
may occur on the same line, in which case the range pattern is true for just the one
line (and a check for pattern1 begins again on the next line). If the second pattern is
never seen, matching continues to the end of all input. Range patterns, as with BEGIN
and END, cannot be compounded with other patterns to form more complicated patterns.
Note that pattern2 specifies the last line to be matched, for example
NR == 1, NR == 2
matches the first and second lines of input.
Range patterns are useful only with input that has been well–organised on a line–by–line
basis, with clear signals for the start and end of a group of lines. An ideal case would be
a file with markers dedicated to indicating the start and end of a group, such as
Start 10 11 -23
47 101 96 End
Start 19 23 End etc
in which case your program could analyze groups with
/Start/, /End/ {actions for the group}
but in real life the only way you’ll see an input file like this is if you make it yourself.
Summary of patterns
A list of beasts in the pattern zoo (regex stands for regular expression, pat
stands for pattern, str stands for string variable):
Pattern Example
---------------- -------------------------------
BEGIN BEGIN blocks are done before all input
END END blocks are done after all input
/regex/ /Mary( \t)+had/
str ~ /regex/ (or !~) $1 ~ /(\-)?[0-9]+/
str ~ "regex" (or !~) $1 ~ "(\\-)?[0-9]+"
relational expression NF > 4
pattern && pattern FNR == 1 && /File title:/
pattern || pattern /Vermont/ || /Maine/
pattern ? pattern : pattern $3 != 0 ? $2 / $3 > 25 : $2 < 0
( pattern ) - see next line
! pattern !($0 == "" || $0 ~/^The end$/)
pattern1 , pattern2 FNR == 5, FNR == 8
There’s no substitute for doing it yourself. Here’s a small program that will let
you try out your own patterns—it’s not saved separately, so select it and save it
into your “hAWK programs” folder under a name that begins with a '$', such as
“$PatternTester”. Substitute your test pattern for the word “pattern” below
when you have one to try out. Grab some example input from somewhere, paste it
into a new window, call hAWK, select “$PatternTester”, and run it with the
“All of front text” input option, leaving “Show stdout” with a check mark. All input
lines that match your pattern will produce a comment in stdout, which will be shown
to you after the run.
#A small program for testing patterns.
#Replace the word "pattern" on the next line with your pattern.
pattern {
print "Pattern matched input line", NR, "which was:"
print "\t", $0
++n
}
END { if (n > 0)
print "Total matches:", n;
else
print "No matches were found.";
}#the end
-------
Actions
-------
Introduction
Virtually everything you have learned about patterns can be carried over to actions for
constructing conditional tests (excepting BEGIN, END, range patterns, and default
behaviour when parts of a pattern are left out). For example,
$1 ~ /NUM/ {if ($2 ~ /RANGE/)
--then the first field contained "NUM", and the
second field contained "RANGE"--
}
or
FNR < 10 {if (FNR == 1)
print "First line of current file is:", $0
else if (FNR == 2)
print "Second line of current file is:", $0
etc
}
which demonstrate that it is possible to place a general test in the pattern, and then proceed
with more specific tests in the action statements.
You’ve probably noticed that hAWK expressions strongly resemble C code, and this is
no accident—leaving aside the advanced machinery of C dealing with pointers, structs
and unions, and multi–dimensional arrays, what you know about writing C carries over
to hAWK. There are some omissions, such as no need to declare variables, no prototypes
for functions, no brackets around the arguments of some built–in functions (print,
getline) that require a bit of adjustment. And there are some additions (most notably
regular expressions, built–in string functions such as “match”, and the way input is
automatically retrieved to $0) which require a bit of work to grasp comfortably. But
regular expressions were the only tough part; the rest is easy by comparison, and
you should count your hAWK diploma as a foregone conclusion if you keep going here.
You have met variables, including built–in and field variables, and the operators which
are especially useful for building patterns: the sections below will round out the list of
operators, describe hAWK’s built–in functions dealing with numbers and strings, and
introduce control–flow statements (if, for, while, etc) which allow you to choose between
alternatives or repeatedly excute statements.
Knowledge of C will speed up learning hAWK. However, hAWK is simpler than C, so if you
are new to C as well you should find that learning hAWK will speed up learning C. Whatever
your background, you should regard hAWK itself as an essential part of this manual; if you
have a small problem, or an idea that wants polishing, whip up a little hAWK program and
give it a try.
A preview of “print”
Ultimately, your hAWK program will produce output. The “print” statement will answer
most all of your output needs, being simpler in form than the “printf” function which has
more sophisticated formatting. Pass “print” a list of variables or constants separated by
commas, and they will be printed to stdout, with the commas replaced by the output field
separator (the built–in variable OFS, by default a blank). The contents of ORS (the output
record separator, by default a newline) will be appended to the end of what was printed.
For example:
this one–line program
{print FNR, $0}
will duplicate all input to stdout, adding a line number to the beginning of each line. The
number will be reset to 1 at the beginning of each input file, but all input files will be
concatenated together in stdout.
{print $1}
will print just the first field of each input line to stdout.
$1 ~ /extern/ {print FILENAME, FNR}
will print the (full path) file name and line number where the word “extern”
was seen.
Variables and strings may be concatenated together by using a space instead of a comma
between them, for example
a = "Sesqui"
b = "alien"
print a "ped" b
which produces "Sesquipedalien" (note there is no built–in spelling checker). Concatenation
is slower than using commas to separate the items for “print”, best used only if you must
avoid having the OFS space between two items. Note that
print a, "ped", b
produces "Sesqui ped alien".
More on “print” later, but for the time being if you find yourself wondering what an
operator or function produces—assign the result to a variable and print it out.
Expression operators
With the exception of string concatenation and the matching operators, the operators in
hAWK are the same as C operators. They apply to both numbers and strings wherever it
is logical, and that numbers are floating point numbers. Note that if a variable is
assigned an integer value then it can be treated as an integer—for example, if
i = 1 at some point, then later the test
if (i == 1) will evaluate to true (non-zero), with no failure due to obscure
floating point rounding trouble.
The operators in hAWK, in order of increasing precedence, are:
--------------------------------------------
= += -= *= /= %= ^=
Assignment. Both absolute assignment ( var " = " value ) and operator-assignment (the
other forms) are supported. “a += b” is equivalent to “a = a + b”.
?: The C conditional expression. This has the form
expr1 " ? " expr2 " : " expr3
If expr1 is true, the value of the expression is expr2 , otherwise it is expr3 . Only one
of expr2 and expr3 is evaluated.
|| logical OR. In “a || b” if a is true then b is not evaluated.
&& logical AND. In “a && b” if a is false then b is not evaluated.
~ !~ regular expression match, negated match. See “String-matching patterns”.
< <= > >= != ==
the regular relational operators. Note especially that strings can be
compared, eg if ($3 == "cat"). In “a <= b” or the like, if both
arguments are numbers the comparison is done numerically,
otherwise they are compared as ASCII strings.
blank string concatenation; if a = "John" and b = "Henry" then
c = a b; produces c = "JohnHenry".
+ - addition and subtraction.
* / % multiplication, division, and modulus ( x%y produces the remainder of
x divided by y, equivalent to x - int(x/y)*y ).
+ - ! unary plus, unary minus, and logical negation.
^ exponentiation.
++ -- increment and decrement, both prefix and postfix.
$ field reference. $0 is the entire current record, $1 the first field,
and $NF the last field. Fields may be changed or added.
Some examples:
{lines[++n} = $0}
accumulates all input lines to the array lines[]. The variable “n” starts out as 0, so
the “++n” produces 1 as the first index. At the end of input “n” is equal to the number
of input lines seen, so
END {print lines[1]; print lines[n]}
would print out the first and last lines of input.
Built–in numeric functions
hAWK has the following pre-defined arithmetic functions, with x and y as
arbitrary expressions:
atan2( y , x ) returns the arctangent of y/x in radians.
cos( x ) returns the cosine of x in radians.
exp( x ) the exponential function "e to the x"
int( x ) truncates to integer (eg int(7.325) gives 7); to round,
use int(x + .5).
log( x ) the natural logarithm function, base e. For log base 10, use
log(x)/log(10).
rand() returns a random number, 0 <= rand() < 1.
sin( x ) returns the sine of x in radians.
sqrt( x ) the square root function.
srand( x ) use x as a new seed for the random number generator. If no
x is provided, the time of day will be used. The return value
is the previous seed for the random number generator.
Some examples:
atan2(0,-1) gives π, and exp(1) gives e.
theta = atan2(y,x)
r = sqrt(x*x + y*y)
converts rectangular x,y to polar r,theta.
int(max * rand())
produces a random integer from 0 to max-1, inclusive.
Built–in string and file functions
There is only one string operator, the concatenation operator, invoked when two variables
or constants are separated by a space. Other useful string manuipulations in hAWK are
carried out by built–in functions. In the following table, r is a regular expression,
s and t are strings, the a and b are arrays, and i and n are integers.
gsub(r, s, t) for each substring matching the regular expression r in
the string t , substitutes the string s , and returns the
number of substitutions. If t is not supplied, uses $0 .
index( s , t ) returns the index of the string t in the string s,
or 0 if t is not present.
length( s ) returns the length of the string s .
match( s , r ) returns the position in s where the regular expression r
occurs, or 0 if r is not present, and sets the values of
RSTART and RLENGTH .
split(s, a, r) splits the string s into the array a on the regular
expression r , and returns the number of fields. If r is
omitted, FS is used instead.
sprintf( fmt , expr-list ) prints expr-list according to fmt , and returns the
resulting string. See the discussion of “printf” for details.
sub(r, s,t) this is just like gsub , but only the leftmost matching
substring is replaced. Returns number of substitutions.
substr(s, i, n) returns the n-character substring of s starting at i . If n
is omitted, the rest of s is used.
tolower( s ) returns a copy of the string s , with all the uppercase
characters in s translated to their corresponding
lowercase counterparts. Non-alphabetic characters are
left unchanged.
toupper( s ) returns a copy of the string s , with all the lowercase
characters in s translated to their corresponding
uppercase counterparts. Non-alphabetic characters are
left unchanged.
lookup( s ) returns integer–coded C type of s (s should be a word).
(At present this function is supported by: EnterAct.
Types are taken from whatever project is open at the
time.) See “$LookupTest” or “$XRef” for an example.
Type integer returned
---- ------------
defined constant or macro 1
file–scope variable 2
function 4
enum constant 8
typedef 16
struct tag 32
union tag 64
enum tag 128
other 0
sort(a,b,s) produces an index in the array “b” that can be used to access
the elements of “a” in sorted order. The string “s” specifies the
kind of sort; "a" for ASCII, "n" for numeric, "d" for dictionary
order, and "ra", "rn", "rd" for reverse of the same. Returns the
number of elements in the array “b”, which is indexed numerically
from 1 upwards. The elements of “b” are the indexes of “a” in
sorted order provided “b” is accessed in the sequence b[1], b[2],
b[3] etc. Typical use is
maxIndex = sort(a, b, "d")
for (i = 1; i <= maxIndex; ++i)
print a[b[i]]
which will print the elements of a in sorted dictionary order.
See “$WordFrequency” and “$XRef_Full” for examples, and
“$SortTest_Nums” for a simple numeric example.
time( ) returns the current time, eg "Sunday, October 27, 1991 09:03:30 AM"
—note this is the time when the function is called, down to the second,
whereas the TIME variable holds the time at which your program run
starts, down to the minute. See “$TIME” for an example.
prompt( s ) displays an OK/Cancel dialog. The string “s” appears at the top of the
dialog, and you can type in a string in an edit field. Returns what you
type in, as though it was a string constant. Both the string “s” and what
you type in are limited to 255 characters. For an example of usage
see “$PromptTest” and “$YoungMath”. Typical use is
x = prompt("Enter the number of lines to print:")
if (x+0 > 0) {
while (getline lne > 0 && ++i <= x) print lne }
If you cancel the dialog or hit <Return.> without typing in any text,
prompt returns the null string "".
NOTE this function is only useful if hAWK is called up in the “immedate”
mode (typically hold down the <Shift> key when selecting “hAWK”). In
“concurrent” mode, “prompt()” does nothing but return the empty
string "" without displaying a dialog.
progress(s) displays the string “s” in a dialog on your screen (the message stays
on the screen). You can change the message with another “progress”
call. “progress” returns the number of times it has been called, and
the dialog goes away by itself at the end of your program run. For a
test sample, see “$ProgressTest”.
NOTE this function is only useful if hAWK is called up in the “immedate”
mode (typically hold down the <Shift> key when selecting “hAWK”). In
“concurrent” mode, “progress()” does nothing but return 0.
--and added for hAWK version 2 (mainly file functions):
Note in the functions below where a file or directory name is required it must
be a full pathname, of the form “disk:folder1:folder2:...:folderN:filename”
for a file, or “disk:folder1:...:folderN” or “disk:folder1:..:folderN:”
for a directory (the second version has a colon at the end). For a disk name,
use “disk:” rather than “disk”.
beep( n ) does a SysBeep(n); if the duration "n" is <= 0, the menu bar will
flash instead. Durations of 0,1,2,5 work best.
copy( s, t ) copies the file named “s” to the file named “t”. Both file names
must be full pathnames (disk:folder:...folder:filename). Either
the location or name or both can be changed. If file “t” already
exists, it must be closed and unlocked. Both creator and type are
preserved, and the resource fork is copied as well as the data
fork. Any kind of file can be copied. To move or rename a file, use
if (copy(s,t)) remove(s)
(this is an efficient way to move a file, but there is a separate
rename() function). NOTE that t's folders will be created if needed.
Returns 1 if successful, 0 if the copy could not be done.
exists( s ) returns 1 if the file named “s” exists, 0 if it does not. Any kind
of file can be tested.
fdate( s ) returns date/time of last modification of file named “s”, format
“yr:mo:day:hr:min:sec” where yr is 4 digits, and the rest are 2
(eg always 01 rather than just 1). The length of the string is
always 19 (or 0 if no date could be extracted) and the colons
and digits always occupy the same positions.
fsize( s ) returns size in bytes of the data fork only of the file named “s”
getclip( n ) returns the calling application’s current clipboard text, up to
a maximum of the first “n” bytes. Use n = 0 or omit it entirely
if you want the entire clipboard. For example, if the current
clip is “Some text here” then getclip(6) returns “Some t”
whereas getclip(0) or getclip() returns the entire clip. At
present this function is supported by: EnterAct.
putclip( s ) replaces the calling application’s (private) clipboard with
the string “s”. Note that other applications won’t see the
change until you switch out of the calling app. The length
of s is limited to 32,767 characters (as are all hAWK strings).
See the “$Clip...” functions in the “hAWK programs” folder
for examples using getclip/putclip. Supported by: EnterAct.
list( s, a ) given file or directory full pathname in “s”, produces list of
full pathnames for all TEXT files in the directory (either the
directory named or the directory holding the file), as elements
indexed 1,2,3... in the array “a”. Note subdirectories are also
excluded. Returns the number of files in the list.
nested( s, a ) given a file full pathname in “s”, generates list of full pathnames
for directories at the same level ("sibling folders"); given directory
name, generates list of subdirectories at the top level in the named
directory (“offspring folders”). The list is returned as elements
indexed 1,2,3... in the array “a”. In other words, the same as
“list” but for folders rather than TEXT files. Note neither “list”
nor “nested” look beneath the top level of the folder in question.
Returns the number of directories in the list.
remove( s ) deletes the file named “s”, provided it is closed and unlocked. Use
with caution, this is not undoable unless you get lucky using your
favourite file recovery tool. Returns 1 if the file was deleted,
0 otherwise. Use with caution!
rename( s, t ) takes the file with full pathname “s”, and renames it “t”. The
new name “t” can be a full pathname, or just the new file name
proper, as in
rename("Disk:dir1:aardvark", "Disk:dir1:fruitbat")
or equivalently
rename("Disk:dir1:aardvark", "fruitbat")
This function works only with files, not directories or volumes,
returning 1 if the rename was carried out, 0 if not.
The version 1 functions form the heart of hAWK, and you will find examples of usage of
one or more of these in nearly all the sample programs. The version 2 functions have
more limited scope, but keep them in mind when you need to wrestle with files.
Within the replacement string 's' of gsub(r,s,t) and sub(r,s,t), a '&' is taken to stand
for the entire string of text that was matched by the regular expression 'r'. For example,
gsub(/cat/, "&s", t) with t = "cat and dogs" produces t = "cats and dogs" after
the substitution. Use “\&” if you want a literal '&' in the replacement string.
Using sub, gsub, and match effectively is entirely a matter of becoming comfortable
with regular expressions (practice makes perfect). The regular expressions in these
functions can be static, as in
if (match($0, /struct/))...
or dynamic (the contents of a variable) as in
wordStart = "^|[^a-zA-Z'-]"#beginning of string or non–word character
optLetters = "[a-zA-Z'-]*"#zero or more word characters
findString = wordStart "(A|a)ct" optLetters
if (match($0, findString))...
(which matches eg “act”, “Actor” but not “tract”, or “Reactor”). It’s sometimes
handy to use the “Set variables” dialog to set the string to be found (see $MFSLister,
for example), or you can even read the string to be found out of the input itself, as in
FNR == 1{find = $1; rep = $2}
FNR > 1{gsub(find, rep)}
which sets the strings for find and replace from the first two fields on the first line
of input, and then uses them to do replacement on all subsequent lines.
A miscellany:
{gsub(/->resourceid/, "->resourceID")
gsub(/\.resourceid/, ".resourceID")
}
copies all input to stdout, changing “resourceid” to “resourceID” when it appears
as a member name (note $0 is used in the gsub by default).
gsub("\n", "\n", multi)
returns a count of the number of returns (newlines) in the string “multi”.
gsub(/boo/, "&&s") turns “boo” into “booboos” everywhere in $0.
index("abcdef", "cd") returns 3.
match("abcdef", /cd/) returns 3, and sets RSTART to 3, RLENGTH to 2.
z = split("hour:minute:second", arr, ":") assigns 3 to z, with
arr[1] = "hour", arr[2] = "minute", arr[3] = "second".
Given str = "Now is the time",
substr(str,1,3) returns "Now", substr(str,8) returns "the time".
More examples follow the next section.
Control-flow statements
Statements in hAWK may be grouped with curly braces, one can execute statements only
when a certain condition is met, and statements can be repeatedly executed according to
the value of some condition. While hAWK does not have a “goto”, it does allow you to
jump back to the top of your pattern–action statements with “next”, or jump to your
END statements on the way out the door with “exit”.
In the following list of control statments, any instance of “statement” can be replaced
by a group of statements enclosed in curly braces {}:
{ statements }
Simple grouping of several statements together, so that conditional or repeated
execution can be applied to the group.
if (condition) statement1 [ else statement1 ]
If the condition evaluates to true then statement1 is carried out; the “else”
clause is optional, and its statements will be executed if the condition is false.
while (condition) statement
The condition is first evaluated, and if it is false then the statement is skipped. If
it is true then the statement is executed; the condition is again evaluated, and the
statements again executed if the condition is true, and this process continues until
the condition is false. Note that if the condition is false the first time then the statement
will not be executed at all. “while” loops are affected by break and continue statements.
do statement while (condition)
The statement is always executed at least once; then the condition is evaluated, and if it
is true then the statement is excuted again. This process continues until the condition
is false. Unlike the “while” loop, the “do” loop always executes its statement at least
once.
for (expr1; expr2; expr3) statement
eg “for (i = 1; i <= 6; ++i) {print i}”
Mnemonically, “for it’s (a jolly good fellow)” helps: in “it’s”, the “i” stands for
initialization, the “t” for “test”, and the “s” for “step”. expr1 is the initialization,
executed only once, just before the “for” loop proper is entered. Next
expr2, the test, is evaluated, and if it is true then the statement is executed, otherwise
the for loop ends and control passes to the next statement beyond it. If the statement is
executed then expr3, the step, is carried out, and then it’s back to the top of the loop
—no more initialization, but the sequence test, execute, step, continues until the test
produces false.
for (var in array) statement
Indexes for the array are retrieved one–by–one to the variable “var”, though not
in a readily predictable order, and the statement is executed for each index.
break
For use only among the statements that make up the body of a while, do, or for loop.
Usually found in the form “if (condition) break;”, when the break is executed then
control immediately passes to the next statement after the loop.
continue
Also for use only in a while, do, or for loop, and also usually executed only when
the condition of some if–statement is true. When encountered, control passes to the
very end of the statements making up the body of the loop, and the next iteration of
the loop begins.
next
Stop processing the current input record. The next input record is read and
processing starts over with the first pattern in the hAWK program. If the end of
the input data is reached, the END block(s), if any, are executed.
exit [ expression ]
In an END action, exit truly causes the hAWK program to terminate. Anywhere
else, the exit statement causes the program to jump to the END actions, and only
if none are present does the program immediately terminate. The “expression”
is provided for compatiblilty with standard AWK programs, and won’t be of any
use to you.
Here’s a small sample program, with lots of potential if you’re looking for
a first hAWK project:
BEGIN { find = "(^|[^@])([A-Z][A-Z]+)" #note \1 \2 grouping by ()()
rep["CA"] = "California"
rep["HYPO"] = "hypobetalipoproteinemia"
rep["RE"] = "regular expression"
#...etc... note just a part of a word is OK
}
{loopCount = 0;
while (match($0, find) && loopCount++ < 50)
{
acronym= substr($0, RSTART, RLENGTH)
gsub(/[^A-Z@#]/, "", acronym) #or sub(find, "\2", acronym)
if (acronym in rep)
sub(find, "\1" rep[acronym])#replace acronym by expansion
else
sub(find, "\1@#@\2")#stick '@#@' in front of unknown acronym
}
if (loopCount >= 50)
{
print "The acronym", acronym, "is looping forever." ; exit
}
gsub(/@#@/, "")#trim the protector by replacing it with null string
print #print the altered line to stdout
}
- builds a glossary at the beginning, and then expands any acronyms in the input for
which there is an entry in the array “rep”, sending the expanded version to stdout.
The “sub” and “match” both match the leftmost longest string of uppercase letters,
and replacement is done one match at a time until the line contains no more matches.
To avoid an endless loop, finds for which there is no expansion have a '@#@' stuck in
front of them. This '@#@' is trimmed away after.
A silly example:
#print arr[] elements with index, according to value of “sequence” string:
#use as much variety as possible, to avoid boredom. If sequence is numeric,
#“arrMax” holds the maximum index.
if (sequence == "up")#Numeric increasing index
{
i = 1;
do
{
print i, arr[i++]
} while (i <= arrMax);
}
else if (sequence == "down")#Numeric decreasing index
{
i = arrMax;
while (i >= 1)
{
print i, arr[i]
--i
}
}
else if (sequence == "associative")#Arbitrary indexes
{
for (i in arr)
{
print i, arr[i]
}
}
else
{
print sequence, "???!!!!"
print "Repeat after me, ten times:"
for (i = 1; i <= 10; ++i)
print "I will proofread my programs."
exit
}
Virtually all of the sample programs in the “hAWK programs” folder illustrate
control–flow statements.
Empty statements
The empty statement, which does nothing at all, is denoted by a semicolon. Loops
require a body of some sort, and if you wish no statements to be executed in the
body of the loop then just use a single semicolon for the body. More rarely, an
empty statement is useful as the statement for an “if” statement.
------------------
User-defined functions
------------------
Functions in hAWK take the form:
"function" name(parameter1, parameter2,... local1, local2...)
{
statements
}
They are executed when called from within an action statement (or as part of a pattern).
hAWK function definitions begin with the keyword “function”, and no return type is
declared, though a value may optionally be returned. Local variables are listed after the
parameters for the function, more to simplify the grammar of the language than
anything else. Scalar parameters are passed by value (ie a local copy is made for the
function, and the original variable in the function call is not touched by the function)
whereas array parameters are passed by reference (the parameter array name refers
to the same array that is provided as the argument). Function definitions must be placed
at the top level of your program outside any pattern–action blocks, and you generally end
up with a readable program if you put all of your function definitions at the end of your
program.
Here’s a typical function:
function Swap(a, i, j temp)
{
temp = a[i]
a[i] = a[j]
a[j] = temp
}
When called, it appears for example as
arr[1] = 7; arr[4] = 9; Swap(arr, 1, 4)
which results in arr[1] = 9, arr[4] = 7. Note that the “temp” variable is intended for
use only within the Swap function, and is a local variable rather than a parameter of
the function.
Local variables are initialized to 0 and "" each time the function is called. No space should
be put between the function name and the '(' of the argument list when calling one of
your own functions, to avoid invoking the simple–minded concatenation operator.
Functions may return an expression, as in
function SumArraySquared(a, sum)
{
for (i in a) #unlike C, array size need not be known separately
sum += a[i]#note sum is local, automatically inited to zero
return sum*sum
}
or
function StringUpTo(str, upto)
{
return substr(str, 1, index(str, upto) - 1)
}
(eg StringUpTo("This is: a test", ":") would return "This is").
Some details about functions:
Newlines are optional after the left curly brace of the function body and before the
closing left brace.
Functions may call each other and may be recursive.
The word func may be used in place of function. For tired typers only.
-------
Output
-------
The “print” statement
“print” sends simply–formatted strings to a file, stdout by default. The expressions
supplied to the print statement are separated from one another by commas, and may
also be entirely surrounded by parentheses. The variations are
print
print expression1, expression2, ..., expressionN
print (expression1, expression2, ..., expressionN)
A “print” with no expressions is an abbreviation for
print $0
Each expression is converted to a string and printed in turn, with each comma being
replaced by the built–in variable OFS, by default a single blank. Each print statement
is terminated with the built–in ORS, by default a newline.
The parenthesized version of “print” is necessary if relational operators are present
in the expressions, since the '>' operator can mean “greater than” or “redirect output
to the file...”—see “Output into files” below.
The print statement is used in virtually every sample program provided, and the
more–sophisticated “printf” is seldom seen since fancy formatting is not often needed.
Some common print statements are
print "" #prints just a blank line
print names[z], FNR #documents location of something by printing file name and line
(search this file from the top for “names[z]” if you missed it)
The “printf” statement
This function also has a parenthesized and unparenthesized form,
printf format, expression1, expression2, ..., expressionN
printf(format, expression1, expression2, ..., expressionN)
and, as with “print”, the parentheses are needed only if a relational operator
is contained in one of the expressions. The “format” argument is interpreted
as a string, and may contain either literal text to be printed or format
specifications for strings or numbers to be printed. Format specs are indicated
in the format string by a '%', and there should be one expression following the
format for each format specification—eg if you specify that a string, a number,
and a string be printed, then you list the string, number, and string after the
format, in the same order, separated by commas.
The hAWK versions of the printf and sprintf functions accept the following
conversion specification formats, entirely borrowed from C:
%c an ASCII character. If the argument used for %c is numeric, it is treated as
a character and printed. Otherwise, the argument is assumed to be a string,
and the only first character of that string is printed.
%d a decimal number (the integer part).
%i just like %d .
%e a floating point number of the form [-]d.ddddddE[+-]dd .
%f a floating point number of the form [-]ddd.dddddd .
%g use e or f conversion, whichever is shorter, with nonsignificant zeros
suppressed.
%o an unsigned octal number (again, an integer).
%s a character string.
%x an unsigned hexadecimal number (an integer).
%X like %x , but using ABCDEF instead of abcdef .
%% a single % character; no argument is converted.
There are optional, additional parameters that may lie between the % and the control
letter (also from C):
- the expression should be left justified within its field (note if the '-'
is absent then the expression is right justified)
width the field should be padded to this width. If the number has a leading
zero, then the field will be padded with zeros. Otherwise it is padded
with blanks.
. prec a number indicating the maximum width of strings or digits to the right
of the decimal point.
For example, %-23.14s prints strings in a field 23 characters wide, left justified,
printing at most 14 characters from the string. And %8.4f will print a floating point
number in a field 8 characters wide, right justified, with 4 digits to the right of the
decimal point.
The dynamic width and prec capabilities of the C library printf routines are not
supported. However, they may be simulated by using the hAWK concatenation operation
to build up a format specification dynamically.
Some examples:
“print var” always appends the value of ORS (by default a newline); to avoid this, use
printf("%s ", var)
and when a newline is needed, supply one yourself with something like
print "" or printf("%s\n", var).
Given strings of variable width in fields $1 and $2, reformat to print these strings
right–justified in two nicely–lined–up columns:
{ one[++n] = $1
two[n] = $2
if (w1 < length($1))
w1 = length($1)
if (w2 < length($2))
w2 = length($2)
}
END {w1 += 2; w2 += 2;#a couple of spaces between columns
for (i = 1; i <= n; ++i)
printf "%" w1 "s" "%" w2 "s\n", one[i], two[i]
}
—this illustrates using the hAWK concatenation operation “to build up a format
specification dynamically”; for example, if w1 = 9 and w2 = 15 (after adding 2) then
we get
printf "%9s%15s\n", one[i], two[i]
as the effective printf statement.
Output into files
By default, “print” and “printf” send all of their output to stdout. However, the
redirection operators '>' and '>>' allow you to send output to any text file.
Redirecting output takes one of the forms
print expression–list > outfile
print(expression–list) > outfile
printf format, expression–list > outfile
printf(format, expression–list) > outfile
print > outfile
or any of those with '>>' instead of '>'. The '>' operator will erase the contents of outfile
before beginning to write to it, whereas '>>' will append what is being printed to outfile
without clearing the file first. Both operators open the file “outfile” the first time it
is encountered in the program, and keep it open. The file will be closed for you at the end
of your program, but if you have many files to write to you should close each output file
yourself when you are done with it, with “close(outfile)”.
hAWK deals with full path names only, and the names of all output files must be full path
names if you want the file to end up in a predictable place. Since hAWK is adept at
manipulating strings, and a file name is just a string, you can manufacture file names
and paths within your program to fit most needs. The built–in variable STDPATH contains
the path leading to your stdout file, so concatenating a file name to the end of STDPATH, as
in
outfile = STDPATH "Search Results"
will allow you to write files to the folder containing your stdout file, which is your
THINK C/Drag_on Modules folder if you followed installation suggestions. The simplest way
to concoct the appropriate path name for an arbitrary location on your hard disk(s) is
to run the hAWK program “$EchoFullPathNames”, choosing a text file in the desired
location as the input for the program. This will give you the explicit full path name, eg
Disk:C Projects:Banana INIT:Banana source:In_your_ear.c
from which you can copy the path to use as prefix for output file names, in this case
Disk:C Projects:Banana INIT:Banana source:
(neglect not that last colon!)
As special cases you can use the names "stderr" and "stdout" to redirect output
to your stderr and stdout files, eg
print "Serious interstitial vacuities have been detected" > "stderr"
which will quietly write the message to your stderr file—you won’t be notified
that anything has been written there. Normally there isn’t much use for redirecting
output to "stdout" since it goes there anyway by default.
If your current input file happens to be in the right location for the output you intend to
write (for example, if the output is to be an altered version of the input, saved under a
different name) you can extract the path part of the input name, and tack it on to the
beginning of your output file name to produce the needed full path name with this:
BEGIN {outfile = "Results"}#a fixed name for this little example
FNR == 1{#at the first line of the current input file
z = split(FILENAME, names, ":");#fragment the full path into the array “names”
for (i = z-1; i >= 1; --i) #note i = z gives the input file name proper
outfile = names[i] ":" outfile;#put path in front of outfile name
}
Can you tell what this program does?
FNR == 1{z = split(FILENAME, names, ":");
outfile = names[z];
if (match(outfile, /[0-9]+\.[cChH]$/) > 0)
{#file name ends in number.c or the like
versNumber = substr(outfile, RSTART, RLENGTH - 2);#just the number
++versNumber;
versNumber = versNumber ".c";
sub(/[0-9]+\.c$/, versNumber, outfile);
}
else
{
print FILENAME, "does not end in number dot c or h, quitting early"
exit
}
for (i = z-1; i >= 1; --i)
outfile = names[i] ":" outfile
}
{print > outfile}
—among other things, it fills up your disk pretty quick. (See $TabsToSpaces.)
Closing files
To close a file named by expr, use
close(expr)
This could be a fairly explicit name, such as
close (STDPATH "Results")
where concatenation is used to create the full name, or it could be simple
close(outfile)
where outfile holds the string that is the full path name for the file being closed.
If you write to a file, then you must close it before subsequently reading from it. More
importantly, there is a limit on the number of files that can be open at once, so if your
program writes to a large or arbitrary number of files it is good policy to close each file
when it is completed. As you will see just below, it is also possible to take input from
an arbitrary file by means of redirection with the “getline” function, and in this case
as well it pays to close a file when you are done with it.
------
Input
------
FS, the input field separator
If you leave FS set to its default value of a single space, then any combination of
blanks and tabs will count as the field separator, and as a “bonus” any leading
blanks or tabs will be removed from the first field of each record, though they will
remain in the record itself (ie $1 is trimmed but $0 is not).
FS is slightly odd in that it has two modes of interpretation; when it is a single character
such as FS = ":" then the single literal character (no matter what it is) is taken as the
input field separator, but if the string for FS is longer than a single character it is
interpreted as a regular expression. Here are some commonly–used field separators:
FS = "[ ]" —necessary if you wish the field separator set to a single space, since
FS = " " invokes the default behaviour described above
FS = "[ ,\t]+" —any mix of blanks, commas, and tabs
FS = "\n" —a field is a complete line (see the discussion in the next section).
RS, the input record separator
In practise RS is either left to its default value of "\n" (ie a record is the same as a line)
or can if needed be set to the null string "", in which case records are separated by one
or more blank lines. The latter corresponds to a simple form of database, with all the
lines of each record grouped together and blank lines between records. With these
multi–line records it is often useful to also set the field separator FS to "\n", so that
a field becomes a complete line.
Alas, these simple conceptions of a record are not often adequate. Narrative text and C
source files require a more flexible approach to input which can be generally stated as
“grab enough input to do the current job, and never mind where the lines end”. Several
solutions are discussed in the “Beyond input records” section of “Advanced
topics”—don’ t skip over the next section on “getline”, though, because it plays a
strong supporting role.
The “getline” function
“getline” is a built–in function that allows you to retrieve input records from the current
input file or from any other file. As you know, the default behaviour of a hAWK program is
to retrieve input from your input files one record at a time, marching through the records
and files from beginning to end. Often, however, one needs to read in a group of lines until
some condition is met, or interrupt regular input to retrieve records from some other file,
and these are the special capabilities that “getline” provides. It can be used in the following
ways:
getline sets $0 from next input record; sets NF, NR, FNR .
getline < file sets $0 from next record of file; sets NF .
getline var sets var from next input record; sets NR, FNR .
getline var < file sets var from next record of file .
and in all cases “getline” returns 1 if a record was successfully retrieved, 0 if the end of file
was encountered, and -1 if some problem occurred, such as failure to find the file.
The effect of “getline” by itself is to dump the current string in $0 and replace it with
the next input record, setting all the usual built–in variables. Program execution then
continues with the statement following “getline”. By comparison, the “next” statement
does everything that “getline” by itself does, but in addition processing starts over
with the first pattern in your hAWK program.
If a variable name is present immediately after “getline”, then the input record is
retrieved to the variable instead of to $0. The '<' symbol is the input redirection
operator meaning “get input from the file...”, and is followed by the name of the input
file to use. Note that file names must be full path names, as is always the case in hAWK.
Some examples:
$MFS_SuperLister uses a buffer holding a variable number of lines, to match regular
expressions that can span more than one line. The heart of this program is the action
{multi = $0;#the first line is already there
while (getline x > 0)#== 0 at end of file, < 0 for error
{
multi = multi "\n" x;
...
}
}
which employs a “getline” to retrieve the contents of the current input file from the
second line to the end of the file (the first line is already present in $0). This program
is discussed further in the “Beyond input records” section of “Advanced topics”.
$FilesInOrderTest illustrates the technique of reading in a list of input files, then setting
up the built–in variables so that those files will be used as input for a program. In other
words, the program receives a single input file which lists the actual input files to use;
this file is read at the start of the program, and used to set up the built–in array ARGV[]
so that the program will be “fooled” into taking input from the specified list of files.
The list of files is read in at the beginning with
BEGIN {while (getline _specific_file_ < ARGV[1] > 0)
{
if (length(_specific_file_) > 1 &&
index(_specific_file_, ":") > 0)
ARGV[ARGC++] = _specific_file_;
}
close(ARGV[1]);
ARGV[1] = "";
}
which reads in the full path names for the input files (one name per line) from the
first input file (ARGV[1]) into the variable “_specific_file_”. This program is
discussed further in the “Other ways of specifying input files” section of “Advanced
topics”.
----------------
The “hAWK” function
----------------
hAWK ( arr ) : executes the hAWK program specified by the array "arr", returns
the “recursive depth” at which the call was executed. The array holds the command–line
arguments to be passed to the new program, indexed 0,1,2.... The hAWK() function is a
recursive call to hAWK itself, with all built–in variables reset to their initial values.
“hAWK” can be called anywhere a function can be called (ie in an action or function, but
not a pattern). It’s just like calling hAWK from the menu, but you don’t get a dialog
so all arguments must be explicitly supplied. If the discussion below of what to put in
"arr" seems a bit brief, see also “The command line and ARGV[]”.
Each call to hAWK() does chew up some memory which is not freed until
all hAWK programs terminate, so there is some finite limit on the number of times that
hAWK() can be called. In addition, memory that your program allocates by creating
arrays is not automatically freed, so if the program called by hAWK() is not the last
thing that will be done then large arrays should be “emptied out” with something like
for (w in array)
delete array[w]
—this memory will then be available for other programs.
While hAWK() can be used to sequentially execute several small programs
(see $Chain), more typically it is used to execute just one program—a program
which is specially created by the calling program to do just the task required.
The primary advantage offered by calling another program from within a program
is that you can select, or even create, the program to be run after doing some
preliminary analysis (reading a file or looking at the preset variables), and the
program which is eventually run will be faster than a more general–purpose one.
$MFS_SuperReplace for example creates a special search–and–replace program
to do the s&r you specify with your “find” and “replace” variables, in which the
regular expression to search for is an explicit string rather than the content of
a variable (ditto the replace string). The advantage is that an explicit regular
expression is analyzed only once at the start of a program, whereas a variable
(dynamic) regular expression is re–analyzed every time it is used, even if its
contents don’t change. The special–purpose program takes a moment to get going,
but then runs noticeably faster than a general–purpose search–and–replace program
which uses variables.
The general incantation to follow for creating the command–line array "arr" is:
if (notFirstCall) #needed only if making more than one hAWK() call
{
x = 0; #arr[] is indexed 0 up - reset to 0 if making more than one call
for (w in arr)
delete arr[w]; #Avoid passing spurious arguments from last hAWK() call
}
arr[x++] = "hAWK"; #The command name in arr[0], anything you like, really.
arr[x++] = "-f" programName; #Full path name, eg
#progName = STDPATH "Drag_on Modules:hAWK programs:" "Type&Run program"
arr[x++] = "-f" FirstLibrary; #Full path name. The "-f" indicates a program name
...
arr[x++] = "-f" LastLibrary;
arr[x++] = "-v" "firstVar=" someVarfirst #Preset variables. "-v" indicates a variable
arr[x++] = "-v" "secondVar=73"; #Value can be hard-set too
...
arr[x++] = "-v" "lastVar=" lastVar
arr[x++] = "--" #Signals only input files, if anything, follow
arr[x++] = FirstInputFile #Full path name
...
arr[x++] = LastInputFile
notFirstCall = 1; #Needed only if making more than one call to hAWK()
depth = hAWK(arr); #invoke the program; returned value can be ignored.
If you wish to pass all input files along to the program being called, use
for (j = 1; j < ARGC; ++j)
arr[x++] = ARGV[j]
If you wish to use stdout as the input, use
arr[x++] = STDPATH "$tempStdOut"
For some real examples, see $Chain, $Type&Run, $RunClip, and $MFS_SuperReplace.
Note that no argument count “argc” needs to be passed to the hAWK() call; internally,
the end of arguments is detected by looking for 10 consecutive null arguments (eg if
arr[8] is non-null and arr[9] through [18] = "", then arr[8] is taken as the last
real argument).
A small bonus; when calling a hAWK program through the main dialog interface you
are limited to presetting at most 10 variables, but when using the hAWK() function
there is no limit on the number of variables you can preset.
-------------
Advanced topics
-------------
“Advanced” is a bit pompous, really—you should have read through the above
material, tried out some of the supplied programs, and written a couple of
small programs yourself by this point. That’s all “advanced” means. And the
last section, “Calling hAWK through Minimal App”, is advanced only in terms
of understanding what’s going on behind the scenes. The instructions themselves
are easy to follow.
Other ways of specifying input
For use when you need to run a hAWK program on several input files with the files
taken in some specific order, or if you need to hard–code the name of an input file into a
program, and intend to process the contents of that file before or after all other
input files.
The way to persuade a hAWK program to treat input files in a specific order is to
prepare the list of files in the order required, and then modify the program to use
that list as the names of the input files. This requires building the list, and a small
addition to the program itself, but it’s not hard to do:
1 If possible, use your calling application to select the files for multi–file operations
(“searching”), and then run the hAWK program “$EchoFullPathNames”. hAWK uses
full path names to specify files, and this program will produce a list of the full path
names for the files you selected, in the window called “$tempStdOut”. You can
painfully construct full path names for your files by hand, but using this hAWK
program is the simpler way.
2 Arrange the full path names into your desired order, and if it’s a list you anticipate
using again, use “Save As” to save the list away permanently (the contents of
$tempStdOut don’t survive from one run to the next).
3 Copy this block of code to the top of your hAWK program, before all other code:
BEGIN {while (getline _specific_file_ < ARGV[1] > 0)
{
if (length(_specific_file_) > 1 && index(_specific_file_, ":") > 0)
ARGV[ARGC++] = _specific_file_;
}
close(ARGV[1]);
ARGV[1] = "";
}#end
This is executed before the rest of your program, and transparently converts the list of
input files in the array ARGV[] to the list provided in the one input file “ARGV[1]”
that is actually supplied when running it. The name of that one orginal input file is
nulled out, which persuades hAWK to ignore it when input processing starts for real.
4 When calling the hAWK program, select your list of files as the only input. If the
list is in the front window, pick “All of front text”, if it’s in a file use the
“Select input file…” option to select the file. Then run the program.
If you want to try this out in a test program, read through “$FilesInOrderTest”,
then run it and pass it a list of files. It will just print the list of files to $tempStdOut,
confirming that they were read in the correct order.
If you want your program to take input from some specific file first, and then take
input from whatever files are provided via the setup dialog, then you can pass your
program the name of the specific file by means of a variable and process the file in a
BEGIN block. Once again, the only real difficulty is to determine the full path name of
the file, and this can be done by using $EchoFullPathNames as described above, but
passing it the single file as input.
The method in full is:
1 Determine the full path name of the specific file, eg
Hard Disk:Top Folder:Bottom folder:theFile
2 Do the processing of this specific file in the BEGIN block of your program, in
the following way:
BEGIN { while (getline _x < _specific_file_ > 0)
{
-process _x, which contains the lines of _specific_file_
}
close(_specific_file_)
- optional other statements in your BEGIN block
}
3 While setting up your program for a run, use “Set variables” to provide the
full path name of the specific file in the variable _specific_file_:
_specific_file_=Hard Disk:Top Folder:Bottom folder:theFile
and then click “Save settings” if you will be using this file name more than once.
4 Run your program, using the setup dialog to take input from wherever is
appropriate. For an example, see “$WordFrequency”.
If you want to process a special file after all regular input, then use the same
structure as in point 2 above, but in an END block rather than a BEGIN block.
If the specific file is to be treated in exactly the same way as your other input files,
but must be processed first, then you can add this BEGIN block to the start of your
program, again using a fixed full path name passed in the variable “_specific_file_”:
BEGIN { for (i = ARGC; i >= 2; --i)#Note this creates ARGV[ARGC]
ARGV[i] = ARGV[i-1];
ARGV[1] = _specific_file_;
ARGC++;
}
Appending a specific input file is even easier, just
BEGIN { ARGV[ARGC++] = _specific_file_; }
You may find these techniques useful if your program needs a list of “data” before
running, in other words too much information to fit in the ten variables that you
can preset before each run.
The built–in variable STDPATH is a path name which specifies the folder that holds,
among other things, your “Drag_on Modules” folder, which in turn holds your “hAWK
programs” folder. If your specific input file is in the “hAWK programs” folder for
example, then you can avoid spelling out the full path name by using “Set variables” to
set “_specific_file_” to just the name of the file, eg
_specific_file_=Initial data file
and then before using _specific_file_ insert the line
_specific_file_ = STDPATH "Drag_on Modules:hAWK programs:" _specific_file_;
to build up the full path name for _specific_file_.
The above two methods can be blended together, for example to process an entire
list of files before dealing with other input files provided by the setup dialog, and
the files could be processed just as easily in an END block as in a BEGIN block.
Beyond input records
Let’s face it, not many text files are organized into neat lines or even groups of lines,
so it is often more appropriate to use hAWK’s automated record retrieval as just the
first stage of input, building functions on top of it to extract the precise input for the
job at hand. Four techniques are discussed below: “control–break”, which keeps track
of current input status by means of variables; “input on demand”, which buries the
problem of getting the next piece of input in a single function; end–buffered input,
which, if it reads in too much, temporarily stores the excess input to one side; and a
rolling buffer, which acts as a multiple–line “window” on the input, the number of
lines being variable at whim.
The “control–break” style of reading input wrestles with the problem that
you don’t know you’ve read in too much input until you’ve read in too much—what to
do then? The general solution is to use variables to keep track of what the current
“state” is (typically the states are “more input wanted” and “oops, a bit too much”).
This leads to control constructs which seem to put the cart before the horse, in that
one first takes action based on the value of a variable, and only later in the program is
the variable set, which requires a bit of planning.
As a simple illustration,
$1 != lastFieldOne { print "New field one is", $1
lastFieldOne = $1
}
which has been seen before, prints the contents of the first field on the input line
whenever it changes. The variable “lastFieldOne” is used to control output.
The general approach with control–breaks is, in pseudo–language:
if (toofar)
scramble to catch up;
else
proceed normally;
set the toofar variable;
At this point, you might want to read through an example of control–breaks: $XRef
deals with the problem of skipping over comments and strings in C code, even though
hAWK reads the input one line at a time and comments and strings can be anywhere.
“Input on demand” is a way of using “getline” in combination with formatting functions
to retrieve input sequentially as though the entire file were one large record, without
cluttering up the top level of your program. The details of translating from line
format to your required format are buried in a function that keeps track of the
relation between the two; once this function is written, the top level of your
program can call this function without worrying about the translation details.
For a full example, see $Print_MENU_Resource, which deals with the problem of
reading and formatting a MENU resource, as retrieved by Read Resource.
End–buffered input relies on retrieving input lines through two functions,
“GetNextLine” and “UngetLine”, and a variable “inBuffer” which keeps track of
whether a line was “ungot”. With this approach there is no need to “scramble to catch
up”, since the extra input is stored to one side until the next “GetNextLine” call. The
conditions under which a line is to be stored due to going too far depend on the context
(ie it’s up to you), but the general approach is
function DoTheJob(file, line)
{ getError = 1;
while (GetNextLine(file, line) > 0)
{
if (you decide that’s too far)
UngetLine(line);
else
process line;
}
}
and the functions that get and unget are
function GetNextLine(file, line)
{
if (getError <= 0)
return getError;
if (inBuffer)
{
line = _buffer;
inBuffer = 0;
return 1
}
return getError = (getline line < file)
}
function UngetLine(line)
{
_buffer = line
inBuffer = 1
}
where “file” is the full path name of the file to take input from.
For an example using end–buffered input, see “The AWK programming language” by
Aho, Kernighan, and Weinberger, page 105. You’ll find this approach useful if you
have small databases to analyse.
The rolling–buffer approach to input adds lines of input to the end of a variable, and
removes them from the front. The variable in question can contain more or fewer lines
according to the needs of the moment, though there should be an upper limit on the number
of lines. In pseudo–language, the general approach to rolling lines of input through a
buffer variable is:
while (getline x > 0)
{
multi = multi "\n" x;#add current line x to end of buffer variable “multi”
process multi however you like;
while (too many lines in multi)
{
j = index(multi, "\n");#position of first newline in multi
#first line in multi, if needed, = substr(multi, 1, j);
multi = substr(multi, j + 1);#trim first line from multi
}
}
The “while (getline x > 0)” loop stops normally when the end of the current input file
is reached (abnormal, as in file missing, is possible but unlikely). You can count
the number of lines in multi at any time with
numMultiLines = gsub("\n", "\n", multi)
which replaces newlines with newlines, and relies on gsub returning the number of
replacements—awkward, but it works. Arbitrary chunks of text can be removed from
the front of multi if desired, rather than removing a line at a time.
For a full and very useful example see “$MFS_SuperLister” which is capable
of matching a regular expression or string of text even if it spans a variable number
of lines. “$MFS_SuperReplace” is similar, doing multi–file search and replace instead
of just listing matches.
Calling hAWK through Minimal App
Minimal App does not support passing text or file lists to hAWK, or showing results
after a run, but these things can be done with a bit of extra work on your part. If
you’re not interested in using Minimal App or some other application that provides
minimal support for hAWK as your main hAWK–caller, you can skip this section.
Since Minimal App does not support text documents at all, you’ll need an editor of some
sort in order to do these things, and the assumption here will be that you’re running
under MultiFinder (or system 7), using your favourite editor. You could also use a
Desk Accessory editor together with Minimal App, a practical alternative if you intend
to do nothing but run hAWK programs for an extended period. However, the focus here is
on running hAWK programs while using an editor that does not support calling hAWK,
by using Minimal App, MultiFinder, and a few workarounds.
Ideally, an editor designed to run under MultiFinder should offer you protection against
creating multiple versions of a file, and provide some automatic means of ensuring that
you are always viewing the most up-to-date version of a file. An adequate solution in a
single–user context would be for all editors to cooperate by offering the options of
automatically saving all open files when switching out, and refreshing all open files
from disk (if necessary) when switching back. At present almost all Macintosh editors
are, in this sense, MultiFinder–unaware. So unless you know otherwise, it’s up to you
to ensure that you keep the screen and disk versions of a file synchronised by Saving and
Reverting with your editor at the appropriate times, as described below. Nuisance, what?
First, let’s look at passing all or part of a file to hAWK, and viewing the result of a run.
Since hAWK provides as your input option just the ability to select a single file when
called through Minimal App, the simplest approach is to use a single common file as the
input file for all programs which expect input from all or part of a file, and use the
setup dialog to set (and save) that file as the input file. Oddly, the simplest file to pick
is stdout ($tempStdOut, in the same folder that holds Minimal App). There is no
conflict between passing stdout to a program as input, and then writing to stdout,
because just before your program is run hAWK will rename your stdout file to
“$tempOutAsInput” and then pass that name to your program. The “old” version
of stdout will be used as input, and the “new” version will hold whatever was written
to stdout during the run. With stdout as your common input file, the approach to use
for passing all or part of a file from your editor to a hAWK program is:
• Open the stdout file (ie $tempStdOut) in your editor, and leave it open (you can create
this file by running $EnumSwitch with no input, or create it with your editor - it goes
in the same folder as Minimal App and the Drag_on Modules folder, at the same level)
• Copy/Paste the input text over all of stdout, and Save it.
• Switch to Minimal App, call up hAWK, and select your program.
• If it’s the very first run, use the “Select input file...” command to select $tempStdOut
as the specific input file, and then Save Settings so the program will remember this.
• Run the program.
•Return to your editor, type a character in the stdout window, and Revert - you’ll see
what was written to stdout by the program. To view any other created or altered files,
you’ll need to open them with your editor.
Here’s an example run, to get you going. The example program is $EnumSwitch, which
takes a list of enum constants and generates a “switch” statement based on them. You
should be viewing this file with your editor, and also have Minimal App up and running
in a separate partition under MultiFinder or system 7 at some point.
• Copy the indented line just below with your editor, and Save it as the entire contents
of $tempStdOut, in the same folder where you’re keeping Minimal App.
{first, second, third, fourth, twilightZone = -99}
• Leave the $tempStdOut file open.
• Switch to Minimal App and select hAWK; use the “Main program:” popup menu
to select “$EnumSwitch” as the program to run.
• Use the “Select input file...” option under the “Take input from:” popup menu
to select your “$tempStdOut” file as the input file to use with $EnumSwitch.
• Click the “Save settings” button so that $EnumSwitch will remember which
input file to use for subsequent runs.
• Click the Run button, and wait until the highlighting goes away from the main
menu bar, signalling that the program is done.
• Return to your editor, type a character in the $tempStdOut window, and pick
Revert; you’ll see the results of $EnumSwitch on the line of enums you started with.
Some programs, such as $MFS_SuperReplace, naturally work with a list of files rather
than just a single file. Here the simplest approach is to pass to your program a single
input file which contains a list of the actual files to use as input. Again, it is best to
settle on a single name for the file which contains the file list, and use the setup dialog
to set the program to take input from this file. Here the name doesn’t matter, and
something like “Standard File List” would do ($tempStdOut and other standard files
are best avoided here). It then remains to; create the list of files, and internally alter
the program(s) so that they will properly interpret the file list.
First, the list of files: it should be a list of full path names, one file per line. You can
generate the full path name for any single file by running “$EchoFullPathNames” with
the file in question as input. Given that path, you can then generate full path names for
other files in the same folder with a bit of copying and replacing of the file name,
leaving the path the same. Some editors can generate full path names for files, which is
an easier approach. If you have no easy way of generating full path names you might
want to create a “master list” of full path names, and selectively copy the needed names
to your “Standard File List” file before running a hAWK program.
Each program that you want to take input from your file list needs a small addition
at the beginning. Open the program, and copy the following BEGIN block into the
program, as the very first block of code in the file:
BEGIN {while (getline _specific_file_ < ARGV[1] > 0)
{
if (length(_specific_file_) > 1 && index(_specific_file_, ":") > 0)
ARGV[ARGC++] = _specific_file_;
}
close(ARGV[1]);
ARGV[1] = "";
}#end addition
This persuades the program to take input from the list of files, rather than treating
the list of files as the input. This may look familiar, as it’s the same alteration
described in the first section of this chapter for persuading a program to take input
from a list of files in specific order.
And finally, to run a hAWK program on a list of files:
• Your “Standard File List” file should contain the exact list of files that you want to
use as input files, as full path names. Remember to Save it if you change it, before
running your program.
• Switch to Minimal App, call up hAWK, and select the program to be run.
• If it’s the very first run, use the “Select input file...” command to select your file
containing the file list as the specific input file, and then Save Settings so the program
will remember this.
• Run the program.
• Back to your editor, and Revert stdout as described above if the program writes to
stdout.
---------------------------
Calling hAWK from your application
---------------------------
What and how
Your application, that is, any application for which you have the source code, should
be a THINK C project. If your application is written for some other C compiler, you
should be able to modify the supplied source without too much anguish. If your application
is not written in C you will still be able to call hAWK if your language supports calling
C–style functions. However, you will have to provide your own equivalent for the
file “Call_Resource.c”, not a trivial undertaking. The following discussion
will assume that your application is built from a THINK C project.
Drag_on Modules, of which hAWK is an example, are CODE resources. To call a
Drag_on Module, you load the first segment of its code (CODE 0), set up a pointer
to an interface structure which contains file names and “callback” functions, and
then jump to the starting address of the CODE resource as though it were a C–style
function. Your application will load a list of Drag_on Modules into a menu for selection
by the user.
Modifying your application to call hAWK and other Drag_on Modules divides into two
stages: adding the source file “Call_Resource.c” to your project and inserting two
function calls in your source; and then, when the basic version has checked out,
deciding what level of support to supply for callback and result–showing functions.
Drag_on Modules can be called by virtually any application, but considerable enhancement
is possible if your application supports text windows and files. For example, hAWK can
take input from the front text window of your application, and relies on your application
to show the text file stdout if the user requests it. If your application doesn’t support text
windows and files it can still call hAWK, but some input options and the showing of
result files will be absent.
Getting started
To get going, add the source file “Call_Resource.c”, in the “code to call Drag_ons”
folder on the same disk where you found this manual, to your application project. You
will also need to add the standard ANSI library if it’s not already in your project (this
won’t add much to the size of your built application). Compile it, and run it as well to
check for linkage errors. If your application lacks some of the toolbox headers that are
normally included in the MacHeaders precompiled standard header then you may have to
explicitly #include them in the file “Call_Resource.c”.
Add two calls in your code
First, decide which of your application menus to use for showing the Drag_on Modules.
Then follow the instructions at the top of “Call_Resource.c” in points 2 and 3 which
describe how and where to place the two calls to functions in “Call_Resource.c”.
InitCallResources() will load a list of Drag_on Modules into your chosen menu, and
CallResource() will call a Drag_on Module when it is selected from your menu.
For an example of adding “Call_Resource.c” to an application and inserting the two
required function calls, see the source code and THINK C project for “Minimal App”
(the two calls are in “minimalApp.c”, and the copy of “Call_Resource.c” in the
“Minimal App” folder is identical to the original in “code to call Drag_ons”).
A minimal version
Verify that line 98 or so of “Call_Resource.c” reads
#define SUPPORT_LEVEL MINIMAL
Bring your THINK C project up to date, and build a new version of your application. In
order for hAWK and company to show up in your menu, the folder “Drag_on Modules”
(with hAWK inside) needs to be in the same folder as your application, at the same
level, so do this first before starting up your application.
Start your application, and you should see hAWK listed under the menu you have chosen
to show Drag_on Modules. Select hAWK, and the setup dialog should appear; however,
input options under the “Take input from:” popup will be limited to just the
“Specific input file...” option. Select the program “$EchoFileNames”, and then use
the “Take input from:” option to select any TEXT file for it to use as input. Click
Run, wait about 2 seconds or until the mouse is back under your control, and then
check however you like that the file “$tempStdOut” contains the name of the file you
selected as input for “$EchoFileNames”.
Callbacks, and showing results
Once you have the above basic version up and running, you should read through the
“Call_Resource.c” file and decide how much support to provide for the tasks of
offering input options and showing the “$tempStdOut” result file. An important
and easily–supported alert function (OKStopAlert) and a function for changing the
cursor to a watch round out the list of functions that enhance hAWK’s
performance (or any Drag_on Module, for that matter). The more functions
you support, the more useful hAWK will be to your users.
If you decide to support any of these optional capabilities, also change the
#define SUPPORT_LEVEL MINIMAL
statement in “Call_Resource.c” to reflect the level of support you are providing
(instructions for this are in the file, around line 86).
Finally, around line >=131 in “Call_Resource.c” you will see the statement
static char callerName[] = "MyApp";
Change the name to the name of your application, and you’re done.
Any enhancements or modifications you make are your own business. However, hAWK
and most of the source code for hAWK is copyright by the Free Software
Foundation—you can distribute hAWK and the source code for it, provided you follow the
restrictions contained in the file “COPYING hAWK”, on the same disk where you found this
manual. Where Dynabyte (Ken Earle) might be construed as owning the copyright, all rights are
waived except the right to copyright, this latter only to preserve the former. Catch 23.
Using a command line
The last parameter to CallResource() is a pointer to an optional text command line. If
this is not NULL, then the command line will be used to invoke the program specified by
the command line, with no dialog shown. There are two things to do to make this work
with your application:
• construct a proper command line for hAWK
• put something in your user interface to let your users call hAWK with the command line.
This is the format of a hAWK command line (note it can cover several lines):
hAWK -f"Program Name" -f"Library Name"
-s -ss -n
-vVariableName="some value" -- MFS "InputFullPathname"
• the entire command line should be a C string (null terminated)
• the command line text must begin with "hAWK" followed by a space or tab
• there must be one program name, as signalled by -f. If you just supply a simple program name,
it must reside in the "hAWK programs" folder. Use a full path name if the program is in some other
folder. If the program name (or any part of the full path name) contains a space, then put quotes ""
around the full name, otherwise the quotes are not needed.
• the library names are the same as program name, and these are optional. Since library names
look the same as the program name, the first one seen is taken as the program name.
• variables are signalled by the -v option, eg -vmyName="Ken E" or -vlevel=1
where the quotes "" are optional if the value contains no spaces or tabs. Spaces before the '=' sign
are optional, but don't put any between the '=' and the actual value. Variables are optional. In
particular, any variable settings that have been saved with the program (by using the setup
dialog) will automatically be passed along with the command line, and so you should set these
variables on the command line only if you want to override the default saved values (to see
those, select the program in the setup dialog and click the "Set variables..." button).
• "--" signals that input files only follow. This is optional, mainly to make reading easier.
• "MFS" stands for "all files currently selected for multi-file operations", an input option
that must be implemented by the calling application. This one is optional.
• input file names are optional, and should be provided as full path names. If any part of the full
name contains a space then the quotes "" are necessary, otherwise they're optional.
You may also optionally use the following output options in the command line (place them before
any "--"):
• -s means show stdout when done
• -ss means show and select stdout when done
• -n means no showing of stdout when done.
If you don't provide an output option, any output option from the settings saved with the program
will be used instead (these correspond to the "Show/select stdout" checkboxes in the setup dialog).
Any output option you do provide overrides the saved settings.
You may supply both "MFS" and one or more specific input files on a command line, and unlike the
dialog approach you may supply any number of variables (the dialog is limited to 10).
As far as the interface goes, pressing <enter> or <command><return> to fire off a command line
is reasonably standard (you may also require the entire command line to be selected, depending
on how confusing things would be otherwise).
Some example command lines:
hAWK -f$EchoFullPathNames -- MFS
hAWK -f$BoilerPlate -vputInComment=1 -vfile="@.c" -vauthor="KE" -vcompany="bdibdi" -ss
-------------
Modifying hAWK
-------------
Introduction
Building hAWK used to be a nontrivial undertaking. Now, just build the "hAWK.µ" CodeWarrior
project, merging it into an existing copy of "hAWK" when the merge dialog appears.
At present, CodeWarrior ANSI libraries suffer from the problem that they allocate
a 65K pointer and never let go of it, but this is worked around by throwing hAWK
into its own heap zone when calling it, then dumping the whole heap when done.
Warning: the original PC code that hAWK is based on is old, very old,
and the modifications to make it Macintosh were rather brutally done. If you plan
major changes to hAWK, expect some grief along the way.
END hAWK MANUAL
(OOPS forgot to provide the Reverse Polish expression interpreter - what a tragedy...)
-------------------
Active index
-------------------
This index lists line numbers for topics, suitable for use with editors that
allow you to jump to or “Go to” a selected line number
| in reg. exp. 1680
|| in patterns 1857
~ (matching operator) 1600
~! (not match operator) 1639
π 2100
\ in reg. exp. 1680
\1...\9 1706
\< 1692
\> 1693
\B 1695
\b 1694
\n 1697
\t 1696
\W 1691
\w 1690
! in patterns 1857
$about the supplied programs 839
$tempStdIn 564 740
$tempStdErr 740
$tempStdOut 692 740 752
$tempStdOut is temporary 774
$ to start program name 525
$ in reg. exp. 1680
$EnumSwitch 401 1041
$FilesInOrderTest 2771
$MFS_SuperLister 2758
$PatternTester 1926
$sample programs see 839
&& in patterns 1857
( ) in reg. exp. 1680
* in reg. exp. 1680
+ in reg. exp. 1680
. in reg. exp. 1680
>, >> (redirection) 2610
? : in patterns 1873
? in reg. exp. 1680
[ ] in reg. exp. 1680
^ in reg. exp. 1680
actions 1947
All of front text 561
ANSI a4 3334
ARGC 1157 1325
ARGV[] 1147 1325
arrays 1439
atan2() 2082
automatic conversion 1405
auto version incrementing 2661
AWK and GAWK 268
backslash to break long lines 1098
beep() 2202
BEGIN (pattern) 1556
break 2351
breaking lines 1090
built–in string and file functions 2109
built–in variables 1325
built–in numeric functions 2082
Call_Resource.c 3242
calling hAWK from your application 3210
cancelling a run 735
close() 2682
command line 1147
comments in the source 1013 1115
comparison operators in patterns 1586
compound patterns 1856
concurrent and immediate modes 460
continue 2355
control–break 2989
control-flow statements 2311
concatenation 2010
constants 1247
conversion, numbers and strings 1405
copy() 2204
cos() 2082
delete 1472
do-while statement 2333
empty statements 2444
END (pattern) 1556
end–buffered input 3025
example hAWK programs 839
exists 2214
exit 2364
exp() 2082
expression operators 2024
expressions (as patterns) 1576
expressions in actions 1967
fields ($1 $2 etc) 1028 1280
fdate() 2216
FILENAME 1325
files, closing 2682
FNR 1325
for (var in array) 1471 2348
for (;;) statement 2338
Front text selection 560
FS (field separator) 1292 1325 2701
fsize() 2221
full path name, splitting 1543 2656
full path names 1222 1366 1545 2626
functions, user–defined 2451
function, local variables 1374
GAWK and AWK 260
getclip() 2222
getline 2731
grouping and breaking lines 1090
gsub() 2109
hAWK programs (folder) 525
hAWK, calling from your application 3210
hAWK, installing 191
hAWK() function 2792
if statement 2324
IGNORECASE 1325
immediate and concurrent modes 460
int() 2082
in (operator) 1459
index() 2109
input files, in order 2886
input on demand 3015
input selection for a program 536
installing hAWK 191
length() 2109
library files 666
lines, breaking and grouping 1090
list() 2234
local variables 1374 2484
log() 2082
lookup() 2141
Main program: (popup) 525
match() 2109
metacharacters 1680
MFS selected files 570
Minimal App 3258
minimalApp.c 3259
missing pattern 1537
modifying hAWK 3305
multiline records 2718
name conventions for programs 525
nested() 2239
next 2360
NF 1325
no input, specifying 590
NR 1325
null string 1267
number versus string 1405
numeric functions, built–in 2082
octal in reg. exp. 1713
OFMT 1325
OFS 1325
tolower() 2109
operators, table of 2033
ordering input files 2886
ORS 1325
output into files 2610
patterns and actions 1527
path names 1222 1366 1545 2626
patterns 1525
pattern, missing 1537
patterns, summary 1905
pipes (none) 287
presetting variables 598
print (preview of) 1990
print (details) 2511
printf statement 2536
printing this manual 232
program name conventions 525
program, input selection 536
prompt() 2174
punctuation, inside / / 1623
punctuation, inside quotes 1630
putclip() 2228
rand() 2082
range patterns 1881
records ($0) 1028 1280
redirecting output 2610
references 244
regular expressions 1644
regular expressions, examples 1752
remove() 2247
rename() 2251
return 2460
RLENGTH 1325
rolling buffer for input 3066
RS (record separator) 1325 1285
RSTART 1325
Run button 452 727
RUNERR 1325
sample hAWK programs 839
Save settings (button) 711
setup dialog 430
setup, saving 711
setting variables before a run 598 1225 1394
Selecting input for a program 536
Select all of stdout (checkbox) 706
Select input file… 582
Select unlisted program… 528
Show stdout (checkbox) 699
sin() 2082
sort() 2156
SortLibrary, sample library 685
specific order for input files 2886
split() 2109
split full path name 1543 2656
sprintf() (see also printf) 2549 2109
sqrt() 2082
srand() 2082
STDPATH 1325 2632 2964
standard input and output 740
statement grouping with {} 2321
stderr 2643
stdout 2643
string functions, built–in 2109
string-matching patterns 1600
string versus number 1405
sub() 2109
substr() 2109
SUBSEP 1325 1451
summary of patterns 1905
supplied hAWK programs 839
system 289
Take input from: (popup) 560
TIME builtin variable 1372
time() 2170
toupper() 2109
uninitialized variables 1267
unix a4 library 3334
user-defined functions 2451
variables 1247
variable, setting before a run 598 1225 1394
version incrementing 2661 (see also $TabsToSpaces)
while statement 2327